第 6 章 Data Transformation

In this chapter, we are going to learn pipe operator and some data processing functions from dplyr package.

6.1 主要參考資料

6.2 Pipe operator

程式設計有很多部份像是個生產線加工流程，pipe operator讓流程更透明易懂。

Consider the following task:

converting the following Taiwan calendar dates into western calendar dates:

民國年月<-c("099/01","099/02","099/03")

We normally layout our strategy in sequential statement such as:

Append “1” to the front of 民國年月, and “/01” to the back of 民國年月。
Given updated 民國年月, use lubridate::ymd() to convert it to a Date(POSIXct) object.
Add 911 years to the step 2 object (using lubridate::years()).

library(stringr)
library(lubridate)
民國年月<-c("099/01","099/02","099/03")

# 1
str_c("1",民國年月,"/01")->民國年月

# 2
lubridate::ymd(民國年月) -> 民國年月

# 3
民國年月 + years(911)

民國年月<-c("099/01","099/02","099/03")

library(dplyr)
# 1
民國年月 %>% 
  str_c("1", . ,"/01") %>%
  ymd %>%
  { . + years(911) }

%>% pipe operator:

預設：%>% 把它左邊（LHS）的物件流入它右邊（RHS）函數的argument第一個位置。 )
若要流入右邊函數的「非」第一位置argument，則用.在右邊函數中標示。
%>% 可進行串接，則下一個%>%會以上個函數輸出結果當做新的LHS物件。
若右側是一串運算，而非函數，則用{...}把右側運算括住，並把LHS物件要出現的位置用.取代。

6.3 Data cleaning

In data science, it is very common to

subsample (取子樣本，即原樣本的一部份觀察值) based on some criteria: filter()
select certain variables (選取某些變數): select()
create a new variables (創新變數): mutate()
compute a statistic value based on observations (計算某些樣本特徵值，如一群觀察值的平均值、最大值等): summarise()
analysis by groups: group_by()

These can easily be achieved through the dplyr package.

library(dplyr)

When we first import a dataset, we usually have to check variable classes, NAs and other anomalies before we continue to exploratory data analysis. Tools provided by dplyr are very handy for this task.

範例資料

政府開放平台:106學年大專院校校別學生數

library(readr)
collegeData <- read_csv("https://quality.data.gov.tw/dq_download_csv.php?nid=6231&md5_url=dda8fca407131483c1d8cbf3d61cf793")

6.3.1 mutate

library(dplyr)
mutate(
  collegeData,
  學校代碼=as.factor(學校代碼),
  學校名稱=as.factor(學校名稱)
  ) -> collegeData2

上述的程式如果用以前的寫法，要如何寫呢？
將日間/進修別，等級別改成factor，同時一年級男生，一年級女生改成numeric。
請使用%>%方式寫。

6.3.2 Parsing characters

要將character class轉成不同class，雖然可以用as.XXX(), XXX為class name, 但它有時會判斷失敗，而把該筆資料轉成NA。

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=as.numeric(一年級男生),
    一年級女生=as.numeric(一年級女生)
  ) -> collegeData3

請檢查前面一年級男/女生的結果。

as.numeric("1,234")

Package readr comes with a variety of character (vector) parsers. They are names as parse_*() where * is the value type to parse to.

Package readr parsers usages:

parse_logical(x, na = c("", "NA"), locale = default_locale())

parse_integer(x, na = c("", "NA"), locale = default_locale())

parse_double(x, na = c("", "NA"), locale = default_locale())

parse_character(x, na = c("", "NA"), locale = default_locale())

where

na=... defines characters that should be parsed as NA.
locale =... is used to setup an environment that describes certain patterns of writing used in some part of the world, where … can be supplied by locale() function. (Note：locale means 場景)

Reference: RDS, Ch11.3 Parsing a vector

We can use locale=locale(xxx=...) to set xxx pattern of writing to different styles. For example, xxx can be

grouping_mark (千分位符號)
decimal_mark (小數點符號)

library(readr)
cl<-locale(grouping_mark = ",") # 創造cl來描述千分位符號使用","的習慣場景
collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl)
  ) -> collegeData4

When there are parsing problem warnings, it is good to use problems(.the_parsed_result_object) such as

problems(collegeData4$一年級男生)

選出collegeData4中一年級男生有parsing問題的樣本之學校名稱，日間/進修別，等級別。
collegeData4裡存不存在觀測值是「一年級男生，或一年級女生只有一個有parsing問題」。(hint: setequal(.x,.y)可比對兩向量的內容集合是否相同,setdiff(.x,.y)會留下.x集合中不在.y集合的元素)

關於set***(.x,.y)一系列的說明可見：

RDS, Ch13.7

New variables generated within mutate() can be sequentially called–variables defined at the front can be called immediately at the back. The following two examples are equivalent.

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl)
  ) -> collegeData4

collegeData4 %>%
  mutate(
    一年級男生每千人=一年級男生/1000,
    一年級女生每千人=一年級女生/1000
    ) -> collegeData5

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl),
    一年級男生每千人=一年級男生/1000,
    一年級女生每千人=一年級女生/1000
    ) -> collegeData6

head(collegeData5[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])

head(collegeData6[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])

6.3.3 Apply to multiple variables

If you want to mutate a group of variables you can use

mutate_all(.data, .funs): for all variables 對所有變數做.funs的動作。.funs可使用funs()函數，使用範例如下：

範例一個函數

library(stringr)
college1stYearData %>%
  mutate_all(funs(parse_number(.,locale=cl))) ->
  college1stYearDataNumeric

範例二個函數

college1stYearDataNumeric %>%
  mutate_all(
    funs(
      每百人=./100,
      每千人=./1000
    )
  )

執行以下程式：

library(googlesheets)
gs_auth(new_user = TRUE)
gsSurvey<-gs_key("1mC9bnxj11NCNoOCw0Vmn4nxERbHtLjeGo9v9C9b2GDE")  #download sheet information as an object
classSurvey<-gs_read(gsSurvey,ws=1) #access the 1st worksheet
classSurvey %>% select(居住地行政區,性別) ->
  classSurveyPartial

將classSurveyPartial的所有變數轉成factor class。

mutate_at(.data, .vars, .funs): for some variables 對.vars所定義選出的變數做.funs的動作, 其中.vars可透過vars()來定義。

範例直接使用變數名稱：

collegeData %>% mutate_at(
  vars(一年級男生,一年級女生),
  funs(parse_number(.,locale=cl))
) -> college1stYearDataA

college1stYearDataA %>%
  select(一年級男生,一年級女生)

vars() can work with the following select helper functions:

使用文字特徵：
starts_with()：以什麼文字開頭
ends_with()：以什麼文字結尾
contains()：包含什麼文字
matches()：其他複雜可透過regular expression表示的文字規則
使用數字特徵：想選出變數x001,x002,…,x083
num_range()：num_range("x",1:83)

範例利用變數名稱規則

collegeData %>% mutate_at(
  vars(starts_with("一年級")),
  funs(parse_number(.,locale=cl))
) -> college1stYearDataB

college1stYearDataB %>% 
  select(一年級男生,一年級女生)

將原始collegeData資料裡的所有男/女生變數都轉成numeric class。