Chapter 7 dplyr
7.1 Package features
Almost all function in dplyr take data frame object as its FIRST input, and return a value which is also a data frame.
As its FIRST input, means:
::xxx_function(dataSet1, ...) -> result1 # can be expressed as
dplyr|> dplyr::xxx_function(...) -> result1 dataSet1
- Returned value is a data frame means that: when we can feed the result into another dplyr function without a problem.
It is very commonly to see users feeding dataSet1
to dplyr::xxx_function
. With its result, feed it to dplyr::yyy_function
. Traditionally, it will be expressed as:
::xxx_function(dataSet1, ...) -> result1
dplyr::yyy_function(result1, ...) -> final_result dplyr
which is the same as:
::yyy_function(dplyr::xxx_function(dataSet1, ...), ...) -> final_result dplyr
Apply the first input property,
::xxx_function(dataSet1, ...) |>
dplyr::yyy_function(...) -> final_result dplyr
We know function is like doing something to the input. So the following phrase:
- use dataSet1 to create a new feature then use the feature to compute its mean
is often expressed as:
|>
dataSet1 function to create a new feature |>
dplyr function to compute its mean. dplyr
7.2 An example
<- list()
drug $source[[1]] <-
drug"https://docs.google.com/spreadsheets/d/17ID43N3zeXqCvbUrc_MbpgE6dH7BjLm8BHv8DUcpZZ4/edit?usp=sharing"
$data <-
drug::read_sheet(
googlesheets4$source[[1]]
drug
)::download_file("https://raw.githubusercontent.com/tpemartin/110-1-r4ds-main/main/support/final_project.R", output="final_project.R")
xfunsource("final_project.R")
# first we correct feature names
names(drug$data) <- unlist(drug$data[1,])
# remove the first feature name row
$data <- drug$data[-1,] drug
7.2.1 Summarise
- 樣本數
<- drug$data
.df |>
.df ::summarise(
dplyr=dplyr::n()
樣本數 )
dplyr::summarise
用來記算各種資料敍述值. Input arguments各別=右邊
說明怎麼算,=左邊
是算完結果要怎麼稱呼它。dplyr::n()
will compute the sample size of the first input argument, which is.df
here.
7.2.2 Create a new feature: mutate
資料涵蓋範圍:
創造date class的
發生日期2
feature column
# 傳統作法
$發生日期 |> as.integer() -> .dates
.df+ 19110000 -> .dates2
.dates ::ymd(.dates2) -> .df$發生日期2 lubridate
# dplyr
|>
.df ::mutate(
dplyr={
發生日期2$發生日期 |> as.integer() -> .dates
.df+ 19110000 -> dates2
.dates ::ymd(.dates2)
lubridate
}|> View() )
- dplyr函數若需要用到1st input的欄位,可以不用寫
1st_input_dataframe$
.
# dplyr
|>
.df ::mutate(
dplyr={
發生日期2|> as.integer() -> .dates # 省略 .df$
發生日期 + 19110000 -> dates2
.dates ::ymd(.dates2)
lubridate
}-> .df # 記得回存,若要保留下來用 )
- 用.df去計算
發生日期2
的range
|>
.df ::summarise(
dplyr=range(發生日期2, na.rm=T)
資料期間 )
|>
.df ::mutate(
dplyr=lubridate::year(發生日期2)
發生西元年|>
) ::summarise(
dplyr={發生西元年 |> unique()}
有哪些年 )
- 並不是每一年都會有毒品破獲的資料,只有以下幾年: 2001 2003 2007 2011 2012 2016 2017 2018 2019.
接下來我們想要提出以下的問題:
毒品問題有越來越嚴重嗎?
哪一個毒品品項是最大的問題來源?
7.2.3 For each group
破獲毒品總重量
依年度與品項分
- 先知道怎麼在dplyr做不分群的計算:
|>
.df ::summarise(
dplyr={
破獲毒品總重量`數量(淨重)_克` |>
as.numeric() |>
sum(na.rm=T)
} )
- 接著在它前一步加上
group_by
|>
.df ::mutate(
dplyr=lubridate::year(發生日期2)
發生西元年-> .df
)
|>
.df ::group_by(發生西元年, 毒品品項) |>
dplyr::summarise(
dplyr={
破獲毒品總重量`數量(淨重)_克` |>
as.numeric() |>
sum(na.rm=T)
}|>
) ::ungroup() -> .summary
dplyr
|> View() .summary
- Every step after
group_by
will be computed for each group until you haveungroup
it. Remember toungroup
when you groupwise computation is done.
7.2.4 Sort by column
7.2.5 Subsample
- 我們以毒品品項中總重量最大的前20個品項進行時間趨勢分析:
<- .summary2$毒品品項[1:20]
top20s
|>
.df ::filter(
dplyr%in% top20s
毒品品項 -> .dftop20s )
filter
is a common function name in many packages. When receiving an error message regarding filter
, if you did not put dplyr::
in front of filter
, you should try to use dplyr::filter
to avoid function name conflict problem.
|>
.dftop20s ::group_by(毒品品項, 發生西元年) |>
dplyr::summarise(
dplyr={
破獲毒品總重量`數量(淨重)_克` |>
as.numeric() |>
sum(na.rm=T)
}|>
) ::ungroup() -> .summary3
dplyr
|> View() .summary3
所有的項目在2018年均有明顯的增加 (Why?)
這裡是以毒品的重量為衡量標準,但在我們下定論以前我們必須要考慮到,原始資料裡面有很多案件的毒品重量都是零,有沒有可能只是一兩件大的案件造成2018重量明顯的超越過去幾年,但以案件數來看沒有那麼嚴重?
這裡的分析可以在一開始時:
library(dplyr)
之後的所有dplyr::
部份均可刪去(當然也可以保留)。
dplyr::filter
建議保留dplyr::
to avoid name clash.
Exercise 7.1 使用dplyr重寫破獲毒品案件次數部分。