Chapter 7 dplyr
7.1 Package features
Almost all function in dplyr take data frame object as its FIRST input, and return a value which is also a data frame.
As its FIRST input, means:
dplyr::xxx_function(dataSet1, ...) -> result1 # can be expressed as
dataSet1 |> dplyr::xxx_function(...) -> result1- Returned value is a data frame means that: when we can feed the result into another dplyr function without a problem.
It is very commonly to see users feeding dataSet1 to dplyr::xxx_function. With its result, feed it to dplyr::yyy_function. Traditionally, it will be expressed as:
dplyr::xxx_function(dataSet1, ...) -> result1
dplyr::yyy_function(result1, ...) -> final_resultwhich is the same as:
dplyr::yyy_function(dplyr::xxx_function(dataSet1, ...), ...) -> final_resultApply the first input property,
dplyr::xxx_function(dataSet1, ...) |>
dplyr::yyy_function(...) -> final_resultWe know function is like doing something to the input. So the following phrase:
- use dataSet1 to create a new feature then use the feature to compute its mean
is often expressed as:
dataSet1 |>
dplyr function to create a new feature |>
dplyr function to compute its mean.7.2 An example
drug <- list()
drug$source[[1]] <-
"https://docs.google.com/spreadsheets/d/17ID43N3zeXqCvbUrc_MbpgE6dH7BjLm8BHv8DUcpZZ4/edit?usp=sharing"
drug$data <-
googlesheets4::read_sheet(
drug$source[[1]]
)
xfun::download_file("https://raw.githubusercontent.com/tpemartin/110-1-r4ds-main/main/support/final_project.R", output="final_project.R")
source("final_project.R")# first we correct feature names
names(drug$data) <- unlist(drug$data[1,])
# remove the first feature name row
drug$data <- drug$data[-1,]7.2.1 Summarise
- 樣本數
.df <- drug$data
.df |>
dplyr::summarise(
樣本數=dplyr::n()
)dplyr::summarise用來記算各種資料敍述值. Input arguments各別=右邊說明怎麼算,=左邊是算完結果要怎麼稱呼它。dplyr::n()will compute the sample size of the first input argument, which is.dfhere.
7.2.2 Create a new feature: mutate
資料涵蓋範圍:
創造date class的
發生日期2feature column
# 傳統作法
.df$發生日期 |> as.integer() -> .dates
.dates + 19110000 -> .dates2
lubridate::ymd(.dates2) -> .df$發生日期2# dplyr
.df |>
dplyr::mutate(
發生日期2={
.df$發生日期 |> as.integer() -> .dates
.dates + 19110000 -> dates2
lubridate::ymd(.dates2)
}
) |> View()- dplyr函數若需要用到1st input的欄位,可以不用寫
1st_input_dataframe$.
# dplyr
.df |>
dplyr::mutate(
發生日期2={
發生日期 |> as.integer() -> .dates # 省略 .df$
.dates + 19110000 -> dates2
lubridate::ymd(.dates2)
}
) -> .df # 記得回存,若要保留下來用- 用.df去計算
發生日期2的range
.df |>
dplyr::summarise(
資料期間=range(發生日期2, na.rm=T)
).df |>
dplyr::mutate(
發生西元年=lubridate::year(發生日期2)
) |>
dplyr::summarise(
有哪些年={發生西元年 |> unique()}
)- 並不是每一年都會有毒品破獲的資料,只有以下幾年: 2001 2003 2007 2011 2012 2016 2017 2018 2019.
接下來我們想要提出以下的問題:
毒品問題有越來越嚴重嗎?
哪一個毒品品項是最大的問題來源?
7.2.3 For each group
破獲毒品總重量
依年度與品項分
- 先知道怎麼在dplyr做不分群的計算:
.df |>
dplyr::summarise(
破獲毒品總重量={
`數量(淨重)_克` |>
as.numeric() |>
sum(na.rm=T)
}
)- 接著在它前一步加上
group_by
.df |>
dplyr::mutate(
發生西元年=lubridate::year(發生日期2)
) -> .df
.df |>
dplyr::group_by(發生西元年, 毒品品項) |>
dplyr::summarise(
破獲毒品總重量={
`數量(淨重)_克` |>
as.numeric() |>
sum(na.rm=T)
}
) |>
dplyr::ungroup() -> .summary
.summary |> View()- Every step after
group_bywill be computed for each group until you haveungroupit. Remember toungroupwhen you groupwise computation is done.
7.2.4 Sort by column
7.2.5 Subsample
- 我們以毒品品項中總重量最大的前20個品項進行時間趨勢分析:
top20s <- .summary2$毒品品項[1:20]
.df |>
dplyr::filter(
毒品品項 %in% top20s
) -> .dftop20sfilter is a common function name in many packages. When receiving an error message regarding filter, if you did not put dplyr:: in front of filter, you should try to use dplyr::filter to avoid function name conflict problem.
.dftop20s |>
dplyr::group_by(毒品品項, 發生西元年) |>
dplyr::summarise(
破獲毒品總重量={
`數量(淨重)_克` |>
as.numeric() |>
sum(na.rm=T)
}
) |>
dplyr::ungroup() -> .summary3
.summary3 |> View()所有的項目在2018年均有明顯的增加 (Why?)
這裡是以毒品的重量為衡量標準,但在我們下定論以前我們必須要考慮到,原始資料裡面有很多案件的毒品重量都是零,有沒有可能只是一兩件大的案件造成2018重量明顯的超越過去幾年,但以案件數來看沒有那麼嚴重?
這裡的分析可以在一開始時:
library(dplyr)之後的所有dplyr::部份均可刪去(當然也可以保留)。
dplyr::filter建議保留dplyr::to avoid name clash.
Exercise 7.1 使用dplyr重寫破獲毒品案件次數部分。