Chapter 7 dplyr

https://dplyr.tidyverse.org/
cheat sheets: https://www.rstudio.com/resources/cheatsheets/
- dplyr

7.1 Package features

Almost all function in dplyr take data frame object as its FIRST input, and return a value which is also a data frame.
As its FIRST input, means:

dplyr::xxx_function(dataSet1, ...) -> result1 # can be expressed as
dataSet1 |> dplyr::xxx_function(...) -> result1

Returned value is a data frame means that: when we can feed the result into another dplyr function without a problem.

It is very commonly to see users feeding dataSet1 to dplyr::xxx_function. With its result, feed it to dplyr::yyy_function. Traditionally, it will be expressed as:

dplyr::xxx_function(dataSet1, ...) -> result1
dplyr::yyy_function(result1, ...) -> final_result

which is the same as:

dplyr::yyy_function(dplyr::xxx_function(dataSet1, ...), ...) -> final_result

Apply the first input property,

dplyr::xxx_function(dataSet1, ...) |>
  dplyr::yyy_function(...) -> final_result

We know function is like doing something to the input. So the following phrase:

use dataSet1 to create a new feature then use the feature to compute its mean

is often expressed as:

dataSet1 |>
  dplyr function to create a new feature |>
  dplyr function to compute its mean.

7.2 An example

drug <- list()
drug$source[[1]] <- 
  "https://docs.google.com/spreadsheets/d/17ID43N3zeXqCvbUrc_MbpgE6dH7BjLm8BHv8DUcpZZ4/edit?usp=sharing"
drug$data <- 
  googlesheets4::read_sheet(
    drug$source[[1]]
  )
xfun::download_file("https://raw.githubusercontent.com/tpemartin/110-1-r4ds-main/main/support/final_project.R", output="final_project.R")
source("final_project.R")

# first we correct feature names
names(drug$data) <- unlist(drug$data[1,])
# remove the first feature name row
drug$data <- drug$data[-1,]

7.2.1 Summarise

樣本數

.df <- drug$data
.df |>
  dplyr::summarise(
    樣本數=dplyr::n()
  )

dplyr::summarise 用來記算各種資料敍述值. Input arguments各別=右邊說明怎麼算, =左邊是算完結果要怎麼稱呼它。
dplyr::n() will compute the sample size of the first input argument, which is .df here.

7.2.2 Create a new feature: mutate

資料涵蓋範圍：
創造date class的發生日期2 feature column

# 傳統作法
.df$發生日期 |> as.integer() -> .dates
.dates + 19110000 -> .dates2
lubridate::ymd(.dates2) -> .df$發生日期2

# dplyr
.df |>
  dplyr::mutate(
    發生日期2={
      .df$發生日期 |> as.integer() -> .dates
      .dates + 19110000 -> dates2
      lubridate::ymd(.dates2)
    }
  ) |> View()

dplyr函數若需要用到1st input的欄位，可以不用寫1st_input_dataframe$.

# dplyr
.df |>
  dplyr::mutate(
    發生日期2={
      發生日期 |> as.integer() -> .dates # 省略 .df$
      .dates + 19110000 -> dates2
      lubridate::ymd(.dates2)
    }
  ) -> .df # 記得回存，若要保留下來用

用.df去計算發生日期2的range

.df |>
  dplyr::summarise(
    資料期間=range(發生日期2, na.rm=T)
  )

.df |>
  dplyr::mutate(
    發生西元年=lubridate::year(發生日期2)
  ) |>
  dplyr::summarise(
    有哪些年={發生西元年 |> unique()}
  )

並不是每一年都會有毒品破獲的資料，只有以下幾年: 2001 2003 2007 2011 2012 2016 2017 2018 2019.

接下來我們想要提出以下的問題:

毒品問題有越來越嚴重嗎?
哪一個毒品品項是最大的問題來源?

7.2.3 For each group

破獲毒品總重量

依年度與品項分

先知道怎麼在dplyr做不分群的計算：

.df |> 
  dplyr::summarise(
    破獲毒品總重量={
      `數量（淨重）_克` |>
        as.numeric() |> 
        sum(na.rm=T)
    }
  )

接著在它前一步加上group_by

.df |>
  dplyr::mutate(
    發生西元年=lubridate::year(發生日期2)
  ) -> .df

.df |> 
  dplyr::group_by(發生西元年, 毒品品項) |>
  dplyr::summarise(
    破獲毒品總重量={
      `數量（淨重）_克` |>
        as.numeric() |> 
        sum(na.rm=T)
    }
  ) |>
  dplyr::ungroup() -> .summary

.summary |> View()

Every step after group_by will be computed for each group until you have ungroup it. Remember to ungroup when you groupwise computation is done.

7.2.4 Sort by column

依品項分

.df |> 
  dplyr::group_by(毒品品項) |>
  dplyr::summarise(
    破獲毒品總重量={
      `數量（淨重）_克` |>
        as.numeric() |> 
        sum(na.rm=T)
    }
  ) |>
  dplyr::ungroup() |>
  dplyr::arrange(
    dplyr::desc(破獲毒品總重量)
  ) -> .summary2

.summary2

前幾大項是什麼？

時間趨勢

毒品問題是否越來越嚴重？

7.2.5 Subsample

我們以毒品品項中總重量最大的前20個品項進行時間趨勢分析：

top20s <- .summary2$毒品品項[1:20]

.df |> 
  dplyr::filter(
    毒品品項 %in% top20s
  ) -> .dftop20s

filter is a common function name in many packages. When receiving an error message regarding filter, if you did not put dplyr:: in front of filter, you should try to use dplyr::filter to avoid function name conflict problem.

.dftop20s |> 
  dplyr::group_by(毒品品項, 發生西元年) |>
  dplyr::summarise(
    破獲毒品總重量={
      `數量（淨重）_克` |>
        as.numeric() |> 
        sum(na.rm=T)
    }
  ) |>
  dplyr::ungroup() -> .summary3

.summary3 |> View()

所有的項目在2018年均有明顯的增加 (Why?)
這裡是以毒品的重量為衡量標準，但在我們下定論以前我們必須要考慮到，原始資料裡面有很多案件的毒品重量都是零，有沒有可能只是一兩件大的案件造成2018重量明顯的超越過去幾年，但以案件數來看沒有那麼嚴重？

這裡的分析可以在一開始時：

library(dplyr)

之後的所有dplyr::部份均可刪去（當然也可以保留）。

dplyr::filter 建議保留dplyr:: to avoid name clash.

Exercise 7.1 使用dplyr重寫破獲毒品案件次數部分。