第 6 章 Data Transformation
In this chapter, we are going to learn pipe operator and some data processing functions from dplyr package.
6.2 Pipe operator
程式設計有很多部份像是個生產線加工流程,pipe operator讓流程更透明易懂。
Consider the following task:
converting the following Taiwan calendar dates into western calendar dates:
民國年月<-c("099/01","099/02","099/03")
We normally layout our strategy in sequential statement such as:
Append “1” to the front of 民國年月, and “/01” to the back of 民國年月。
Given updated 民國年月, use
lubridate::ymd()
to convert it to a Date(POSIXct) object.Add 911 years to the step 2 object (using
lubridate::years()
).
library(stringr)
library(lubridate)
民國年月<-c("099/01","099/02","099/03")
# 1
str_c("1",民國年月,"/01")->民國年月
# 2
lubridate::ymd(民國年月) -> 民國年月
# 3
民國年月 + years(911)
民國年月<-c("099/01","099/02","099/03")
library(dplyr)
# 1
民國年月 %>%
str_c("1", . ,"/01") %>%
ymd %>%
{ . + years(911) }
%>%
pipe operator:
預設:
%>%
把它左邊(LHS)的物件流入它右邊(RHS)函數的argument第一個位置。 )若要流入右邊函數的「非」第一位置argument,則用
.
在右邊函數中標示。%>%
可進行串接,則下一個%>%
會以上個函數輸出結果當做新的LHS物件。若右側是一串運算,而非函數,則用
{...}
把右側運算括住,並把LHS物件要出現的位置用.
取代。
6.3 Data cleaning
In data science, it is very common to
subsample (取子樣本,即原樣本的一部份觀察值) based on some criteria: filter()
select certain variables (選取某些變數): select()
create a new variables (創新變數): mutate()
compute a statistic value based on observations (計算某些樣本特徵值,如一群觀察值的平均值、最大值等): summarise()
analysis by groups: group_by()
These can easily be achieved through the dplyr package.
library(dplyr)
When we first import a dataset, we usually have to check variable classes, NAs and other anomalies before we continue to exploratory data analysis. Tools provided by dplyr are very handy for this task.
範例資料
政府開放平台:106學年大專院校校別學生數
library(readr)
collegeData <- read_csv("https://quality.data.gov.tw/dq_download_csv.php?nid=6231&md5_url=dda8fca407131483c1d8cbf3d61cf793")
6.3.1 mutate
library(dplyr)
mutate(
collegeData,
學校代碼=as.factor(學校代碼),
學校名稱=as.factor(學校名稱)
) -> collegeData2
上述的程式如果用以前的寫法,要如何寫呢?
將日間/進修別,等級別改成factor,同時一年級男生,一年級女生改成numeric。
- 請使用
%>%
方式寫。
6.3.2 Parsing characters
要將character class轉成不同class,雖然可以用as.XXX()
, XXX為class name, 但它有時會判斷失敗,而把該筆資料轉成NA。
collegeData %>%
mutate(
`日間∕進修別`=as.factor(`日間∕進修別`),
等級別=as.factor(等級別),
一年級男生=as.numeric(一年級男生),
一年級女生=as.numeric(一年級女生)
) -> collegeData3
請檢查前面一年級男/女生的結果。
as.numeric("1,234")
Package readr comes with a variety of character (vector) parsers. They are names as parse_*()
where * is the value type to parse to.
Package readr parsers usages:
parse_logical(x, na = c("", "NA"), locale = default_locale())
parse_integer(x, na = c("", "NA"), locale = default_locale())
parse_double(x, na = c("", "NA"), locale = default_locale())
parse_character(x, na = c("", "NA"), locale = default_locale())
where
na=...
defines characters that should be parsed as NA.locale =...
is used to setup an environment that describes certain patterns of writing used in some part of the world, where … can be supplied bylocale()
function. (Note:locale means 場景)
Reference: RDS, Ch11.3 Parsing a vector
We can use locale=locale(xxx=...)
to set xxx pattern of writing to different styles. For example, xxx can be
grouping_mark (千分位符號)
decimal_mark (小數點符號)
library(readr)
cl<-locale(grouping_mark = ",") # 創造cl來描述千分位符號使用","的習慣場景
collegeData %>%
mutate(
`日間∕進修別`=as.factor(`日間∕進修別`),
等級別=as.factor(等級別),
一年級男生=parse_number(一年級男生,locale=cl),
一年級女生=parse_number(一年級女生,locale=cl)
) -> collegeData4
When there are parsing problem warnings, it is good to use problems(.the_parsed_result_object)
such as
problems(collegeData4$一年級男生)
選出collegeData4中一年級男生有parsing問題的樣本之學校名稱,日間/進修別,等級別。
collegeData4裡存不存在觀測值是「一年級男生,或一年級女生只有一個有parsing問題」。(hint:
setequal(.x,.y)
可比對兩向量的內容集合是否相同,setdiff(.x,.y)
會留下.x集合中不在.y集合的元素)
關於set***(.x,.y)
一系列的說明可見:
New variables generated within mutate()
can be sequentially called–variables defined at the front can be called immediately at the back. The following two examples are equivalent.
collegeData %>%
mutate(
`日間∕進修別`=as.factor(`日間∕進修別`),
等級別=as.factor(等級別),
一年級男生=parse_number(一年級男生,locale=cl),
一年級女生=parse_number(一年級女生,locale=cl)
) -> collegeData4
collegeData4 %>%
mutate(
一年級男生每千人=一年級男生/1000,
一年級女生每千人=一年級女生/1000
) -> collegeData5
collegeData %>%
mutate(
`日間∕進修別`=as.factor(`日間∕進修別`),
等級別=as.factor(等級別),
一年級男生=parse_number(一年級男生,locale=cl),
一年級女生=parse_number(一年級女生,locale=cl),
一年級男生每千人=一年級男生/1000,
一年級女生每千人=一年級女生/1000
) -> collegeData6
head(collegeData5[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])
head(collegeData6[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])
6.3.3 Apply to multiple variables
If you want to mutate a group of variables you can use
- mutate_all(.data, .funs): for all variables 對所有變數做.funs的動作。.funs可使用
funs()
函數,使用範例如下:
範例 一個函數
library(stringr)
college1stYearData %>%
mutate_all(funs(parse_number(.,locale=cl))) ->
college1stYearDataNumeric
範例 二個函數
college1stYearDataNumeric %>%
mutate_all(
funs(
每百人=./100,
每千人=./1000
)
)
執行以下程式:
library(googlesheets)
gs_auth(new_user = TRUE)
gsSurvey<-gs_key("1mC9bnxj11NCNoOCw0Vmn4nxERbHtLjeGo9v9C9b2GDE") #download sheet information as an object
classSurvey<-gs_read(gsSurvey,ws=1) #access the 1st worksheet
classSurvey %>% select(居住地行政區,性別) ->
classSurveyPartial
將classSurveyPartial的所有變數轉成factor class。
- mutate_at(.data, .vars, .funs): for some variables 對.vars所定義選出的變數做.funs的動作, 其中.vars可透過
vars()
來定義。
範例 直接使用變數名稱:
collegeData %>% mutate_at(
vars(一年級男生,一年級女生),
funs(parse_number(.,locale=cl))
) -> college1stYearDataA
college1stYearDataA %>%
select(一年級男生,一年級女生)
vars()
can work with the following select helper functions:
使用文字特徵:
starts_with():以什麼文字開頭
ends_with():以什麼文字結尾
contains():包含什麼文字
matches():其他複雜可透過regular expression表示的文字規則使用數字特徵:想選出變數x001,x002,…,x083
num_range():num_range("x",1:83)
範例 利用變數名稱規則
collegeData %>% mutate_at(
vars(starts_with("一年級")),
funs(parse_number(.,locale=cl))
) -> college1stYearDataB
college1stYearDataB %>%
select(一年級男生,一年級女生)
將原始collegeData資料裡的所有男/女生變數都轉成numeric class。