第 6 章 Data Transformation

In this chapter, we are going to learn pipe operator and some data processing functions from dplyr package.

6.2 Pipe operator

程式設計有很多部份像是個生產線加工流程,pipe operator讓流程更透明易懂。

Consider the following task:

converting the following Taiwan calendar dates into western calendar dates:

民國年月<-c("099/01","099/02","099/03")

We normally layout our strategy in sequential statement such as:

  1. Append “1” to the front of 民國年月, and “/01” to the back of 民國年月。

  2. Given updated 民國年月, use lubridate::ymd() to convert it to a Date(POSIXct) object.

  3. Add 911 years to the step 2 object (using lubridate::years()).

library(stringr)
library(lubridate)
民國年月<-c("099/01","099/02","099/03")

# 1
str_c("1",民國年月,"/01")->民國年月

# 2
lubridate::ymd(民國年月) -> 民國年月

# 3
民國年月 + years(911)
民國年月<-c("099/01","099/02","099/03")

library(dplyr)
# 1
民國年月 %>% 
  str_c("1", . ,"/01") %>%
  ymd %>%
  { . + years(911) }

%>% pipe operator:

  1. 預設:%>% 把它左邊(LHS)的物件流入它右邊(RHS)函數的argument第一個位置。 )

  2. 若要流入右邊函數的「非」第一位置argument,則用.在右邊函數中標示。

  3. %>% 可進行串接,則下一個%>%會以上個函數輸出結果當做新的LHS物件。

  4. 若右側是一串運算,而非函數,則用{...}把右側運算括住,並把LHS物件要出現的位置用.取代。

6.3 Data cleaning

In data science, it is very common to

  • subsample (取子樣本,即原樣本的一部份觀察值) based on some criteria: filter()

  • select certain variables (選取某些變數): select()

  • create a new variables (創新變數): mutate()

  • compute a statistic value based on observations (計算某些樣本特徵值,如一群觀察值的平均值、最大值等): summarise()

  • analysis by groups: group_by()

These can easily be achieved through the dplyr package.

library(dplyr)

When we first import a dataset, we usually have to check variable classes, NAs and other anomalies before we continue to exploratory data analysis. Tools provided by dplyr are very handy for this task.

範例資料

政府開放平台:106學年大專院校校別學生數

library(readr)
collegeData <- read_csv("https://quality.data.gov.tw/dq_download_csv.php?nid=6231&md5_url=dda8fca407131483c1d8cbf3d61cf793")

6.3.1 mutate

library(dplyr)
mutate(
  collegeData,
  學校代碼=as.factor(學校代碼),
  學校名稱=as.factor(學校名稱)
  ) -> collegeData2
  1. 上述的程式如果用以前的寫法,要如何寫呢?

  2. 日間/進修別,等級別改成factor,同時一年級男生,一年級女生改成numeric。

  3. 請使用%>%方式寫。

6.3.2 Parsing characters

要將character class轉成不同class,雖然可以用as.XXX(), XXX為class name, 但它有時會判斷失敗,而把該筆資料轉成NA。

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=as.numeric(一年級男生),
    一年級女生=as.numeric(一年級女生)
  ) -> collegeData3

請檢查前面一年級男/女生的結果。

as.numeric("1,234")

Package readr comes with a variety of character (vector) parsers. They are names as parse_*() where * is the value type to parse to.

Package readr parsers usages:

parse_logical(x, na = c("", "NA"), locale = default_locale())

parse_integer(x, na = c("", "NA"), locale = default_locale())

parse_double(x, na = c("", "NA"), locale = default_locale())

parse_character(x, na = c("", "NA"), locale = default_locale())

where

  • na=... defines characters that should be parsed as NA.

  • locale =... is used to setup an environment that describes certain patterns of writing used in some part of the world, where … can be supplied by locale() function. (Note:locale means 場景)

Reference: RDS, Ch11.3 Parsing a vector

We can use locale=locale(xxx=...) to set xxx pattern of writing to different styles. For example, xxx can be

  • grouping_mark (千分位符號)

  • decimal_mark (小數點符號)

library(readr)
cl<-locale(grouping_mark = ",") # 創造cl來描述千分位符號使用","的習慣場景
collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl)
  ) -> collegeData4

When there are parsing problem warnings, it is good to use problems(.the_parsed_result_object) such as

problems(collegeData4$一年級男生)
  1. 選出collegeData4中一年級男生有parsing問題的樣本之學校名稱,日間/進修別,等級別

  2. collegeData4裡存不存在觀測值是「一年級男生,或一年級女生只有一個有parsing問題」。(hint: setequal(.x,.y)可比對兩向量的內容集合是否相同,setdiff(.x,.y)會留下.x集合中不在.y集合的元素)

關於set***(.x,.y)一系列的說明可見:

New variables generated within mutate() can be sequentially called–variables defined at the front can be called immediately at the back. The following two examples are equivalent.

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl)
  ) -> collegeData4

collegeData4 %>%
  mutate(
    一年級男生每千人=一年級男生/1000,
    一年級女生每千人=一年級女生/1000
    ) -> collegeData5
collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl),
    一年級男生每千人=一年級男生/1000,
    一年級女生每千人=一年級女生/1000
    ) -> collegeData6
head(collegeData5[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])

head(collegeData6[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])

6.3.3 Apply to multiple variables

If you want to mutate a group of variables you can use


  • mutate_all(.data, .funs): for all variables 對所有變數做.funs的動作。.funs可使用funs()函數,使用範例如下:

範例 一個函數

library(stringr)
college1stYearData %>%
  mutate_all(funs(parse_number(.,locale=cl))) ->
  college1stYearDataNumeric

範例 二個函數

college1stYearDataNumeric %>%
  mutate_all(
    funs(
      每百人=./100,
      每千人=./1000
    )
  )

執行以下程式:

library(googlesheets)
gs_auth(new_user = TRUE)
gsSurvey<-gs_key("1mC9bnxj11NCNoOCw0Vmn4nxERbHtLjeGo9v9C9b2GDE")  #download sheet information as an object
classSurvey<-gs_read(gsSurvey,ws=1) #access the 1st worksheet
classSurvey %>% select(居住地行政區,性別) ->
  classSurveyPartial

將classSurveyPartial的所有變數轉成factor class。


  • mutate_at(.data, .vars, .funs): for some variables 對.vars所定義選出的變數做.funs的動作, 其中.vars可透過vars()來定義。

範例 直接使用變數名稱:

collegeData %>% mutate_at(
  vars(一年級男生,一年級女生),
  funs(parse_number(.,locale=cl))
) -> college1stYearDataA

college1stYearDataA %>%
  select(一年級男生,一年級女生)

vars() can work with the following select helper functions:

  • 使用文字特徵:
    starts_with():以什麼文字開頭
    ends_with():以什麼文字結尾
    contains():包含什麼文字
    matches():其他複雜可透過regular expression表示的文字規則

  • 使用數字特徵:想選出變數x001,x002,…,x083
    num_range():num_range("x",1:83)

範例 利用變數名稱規則

collegeData %>% mutate_at(
  vars(starts_with("一年級")),
  funs(parse_number(.,locale=cl))
) -> college1stYearDataB

college1stYearDataB %>% 
  select(一年級男生,一年級女生)

將原始collegeData資料裡的所有男/女生變數都轉成numeric class。