第 2 章 Data cleaning
2.1 Data frame
A special type of list:
- each element MUST be an atomic vector and have the same length (aka number of observations). (atomic vector requirement can be dropped in an extended data frame class, tibble)
More extraction methods and dplyr provides many useful tools.
It is still a list. So you can use purrr package as well.
2.2 Extraction
All
$
,[[]]
,[]
apply to data frame as apply to list.Rectangular object, such as data frame and matrix, has one more extraction tool
[row_index, column_index]
2.3 dplyr
Probably the most widely used package in R. It has a key feature.
All dplyr function take data frame as its first argument, and return a data frame.
In dplyr function, operations on variables just express as operation on variables directly. No need to do extraction from data frame.
Example
=data.frame(
DFy=c(1,3,5),
x=c(0,-1,1)
)
We want to add z to DF where \[z=y+x^2,\] and then take mean of z.
# without dplyr
{$z=DF$y+(DF$x)^2
DF
mean(DF$z)
}
# with dplyr
{mutate(DF, z=y+x^2) -> DF
summarise(DF, meanz=mean(z)) -> DF_meanz
}
# Using pipe operator
{%>%
DF mutate(z=y+x^2) %>%
summarise(meanz=mean(z))
}
With DF to be its first arg and return, you can use %>%
continuously until you don’t need dplyr’s assistance.
可以不需要從母物件Extraction就可以對子元素運算還有:
with(data, expression)
: 使用data裡的元素去完成expression.within(data, expression)
: 使用data裡的元素去完成expression, expression裡新創的物件都會加到data裡
with(DF,{
=y-x^2
w# 沒這行便沒有回傳值
w
} )
within(DF,
{=y-x^2
w=3*y+w
q })
- 兩者也可以用在list.
2.3.1 YouBike data
browseURL("https://data.ntpc.gov.tw/datasets?keyword=%E6%96%B0%E5%8C%97%E5%B8%82%E5%85%AC%E5%85%B1%E8%87%AA%E8%A1%8C%E8%BB%8A%E7%A7%9F%E8%B3%83%E7%B3%BB%E7%B5%B1%28YouBike%29")
<- "https://data.ntpc.gov.tw/api/datasets/71CD1490-A2DF-4198-BEF1-318479775E8A/json"
targetUrl library(jsonlite)
<- fromJSON(
list_newTaipeiYouBike simplifyDataFrame = F) # list class
targetUrl, <- fromJSON(
df_newTaipeiYouBike # data.frame class targetUrl)
%>% typeof()
list_newTaipeiYouBike %>% typeof() df_newTaipeiYouBike
%>% class()
list_newTaipeiYouBike %>% class() df_newTaipeiYouBike
methods(class="list")
methods(class="data.frame")
2.4 Check data (mutate/transmute)
2.4.1 Class
- Proper class will make your data exploration much easier, no matter visually or not.
sno(站點代號)、sna(中文場站名稱)、tot(場站總停車格)、sbi(可借車位數)、sarea(中文場站區域)、mday(資料更新時間)、lat(緯度)、lng(經度)、ar(中文地址)、sareaen(英文場站區域)、snaen(英文場站名稱)、aren(英文地址)、bemp(可還空位數)、act(場站是否暫停營運)
%>%
df_newTaipeiYouBike str()
2.4.2 mutate/transmute
%>%
df_newTaipeiYouBike mutate(
tot=as.integer(tot),
sbi=as.integer(sbi),
bemp=as.integer(bemp),
=tot-(sbi+bemp), # 創立新變數
車位落差=(車位落差!=0) # 前面出現過的可立刻拿來運算
有落差-> df_newTaipeiYouBike2
)
%>%
df_newTaipeiYouBike transmute( # 只留下這裡的變數
tot=as.integer(tot),
sbi=as.integer(sbi),
bemp=as.integer(bemp),
=tot-(sbi+bemp), # 創立新變數
車位落差=(車位落差!=0) # 前面出現過的可立刻拿來運算
有落差-> df_newTaipeiYouBike3 )
檢視
{View(df_newTaipeiYouBike2)
View(df_newTaipeiYouBike3)
}
2.5 Block programming (summarise/across)
Use {}
to turn a multiple-steps task into a block.
Give you a better sense of what you are doing

= {
taskName
... }
The task can be re-used easily when we talk about drake.
The name can also represent a valued object. The value of that object will be the LAST LINE produced value ONLY IF that line will show value on your screen.
- Useful when you later want to turn it into function body, a plan, etc.
= {
viewDataFrames View(df_newTaipeiYouBike2)
View(df_newTaipeiYouBike3)
} viewDataFrames
= {
viewDataFrames View(df_newTaipeiYouBike2)
View(df_newTaipeiYouBike3)
df_newTaipeiYouBike2
} viewDataFrames
Exercise 2.1 請將df_newTaipeiYouBike
所有的變數改成合理的class。
2.5.1 Missing value
What is the symbol for missing value? n/a, -9999, etc.
- Convert those symbols into NAs in R.
How many NA? Percentage in sample?
2.5.2 summarise
%>%
df_newTaipeiYouBike summarise(
sna=sum(is.na(sna)),
sno=sum(is.na(sno))
)
# Use %>% to observe formula body
%>%
df_newTaipeiYouBike summarise(
sna=sna %>% {sum(is.na(.))},
sno=sno %>% {sum(is.na(.))}
)
2.5.3 across
across(.cols, .fns)
tidy-select:
Arguments:
.cols
follow tidy-select, such as針對變數varm到varq間欄位:
.cols=varm:varq
selects all variables from varm to varq.針對所有欄位:
.cols=everythins()
針對AAA及BBB兩個欄位:
.cols=all_of(c("AAA","BBB"))
針對AAA及BBB任何欄位:
.cols=any_of(c("AAA","BBB"))
, 用在不確定哪個欄位會在,但有在的話就針對它; 相對於all_of,any_of就算寫到不存在的欄位也不會有error。針對is.numeric函數回傳為T的欄位:
.cols=where(is.numeric)
.針對欄位名稱有area字眼的欄位:
.cols=contains("area")
針對橍位名稱滿足regex “a$”的欄位:
.cols=matches("a\$")
$df_newTaipeiYouBike ->df_newTaipeiYouBike
data01names(df_newTaipeiYouBike)
%>%
df_newTaipeiYouBike transmute(
across(
.cols=tot:mday,
.fns = ~.x
)
)
%>%
df_newTaipeiYouBike transmute(
across(
.cols=any_of(c("aaa", "sna")),
.fns = ~.x
) )
選出來的欄位可進一步
&
交集,|
連集,!
補集合,c()
多群集合。
%>%
df_newTaipeiYouBike transmute(
across(
.cols= contains("a") & sbi:sareaen,
.fns= ~.x
) )
# summarise(., across(selector, functions))
%>%
df_newTaipeiYouBike summarise(
across(
everything(),
~{sum(is.na(.x))}
) )
2.5.4 應用
sno(站點代號)、sna(中文場站名稱)、tot(場站總停車格)、sbi(可借車位數)、sarea(中文場站區域)、mday(資料更新時間)、lat(緯度)、lng(經度)、ar(中文地址)、sareaen(英文場站區域)、snaen(英文場站名稱)、aren(英文地址)、bemp(可還空位數)、act(場站是否暫停營運)
修正class
= { # deal with class
youBikeRightClass %>% map(class)
df_newTaipeiYouBike
<- c("tot","sbi","lat","lng")
numericVs
%>%
df_newTaipeiYouBike mutate(
across(
all_of(numericVs),
as.numeric
)-> df_newTaipeiYouBike2
)
<- c("sarea")
factorVs %>%
df_newTaipeiYouBike2 mutate(
across(
all_of(factorVs),
as.factor
)-> df_newTaipeiYouBike3
)
%>%
df_newTaipeiYouBike3 mutate(
mday=ymd_hms(mday)
) }
2.5.4.1 檢查NA
={
checkNA<- function(.x) sum(is.na(.x))
totNA <- function(.x) length(.x) - totNA(.x)
smplSize %>%
youBikeRightClass summarise(
across(
everything(),
list(na=totNA, size=smplSize) # na, size會成為欄位名稱一部份
)
) }
Exercise 2.2 請創造一個函數:mutate_class(df, numericVs, factorVs)
- df: A data frame.
- numericVs: A character vector of variable names that should be numeric class
- factorVs: A character vector of variable names that should be factor class
Return: A data frame with correct class setting
If your source provides a data frame structure, use Import Dataset > From Text(readr) to import your data.
2.6 勞動統計查詢網📊
Our gov’t uses xlsx format a lot. However, sometimes you will encounter some error. An easy way to solve it is to upload to Google Sheets which read almost everything with accuracy. Then use googlesheets4 package to import.
library(googlesheets4)
read_sheet(
"https://docs.google.com/spreadsheets/d/1PPWeFGqedVsgZmV81MeA0oJDCBkj5upkHltlWap7xUY/edit#gid=1835013852",
skip=3
-> startSalary )
- If not working, in Console, try
gs4_deauth()
gs4_auth()
One challenge to data science project is the balance between data exploration and data I/O. econDV::%=%
will check if the object already there before import (and then stored in R.cache).
library(econDV)
%=% read_sheet(
startSalary "https://docs.google.com/spreadsheets/d/1PPWeFGqedVsgZmV81MeA0oJDCBkj5upkHltlWap7xUY/edit#gid=1835013852",
skip=3
)
2.7 Parsing functions
- Though
as.{class}
functions are good,readr::parse_...
functions are even better. If something goes wrong, it gives you clear information.
readr package has several parse_… function to parse characters into different types of vectors:
parse_double
,parse_character
, … , etc
Arguments:
na=c(....)
defines what counts as NA.locale=locale(...)
wherelocale(...)
creates an object that defines and interprets the string accordingly.locale(decimal_mark=".", grouping_mark=",")
: 小數點為.,千分位點為, 。
<- c("$10.000,23", "$999,15", "n/a")
price parse_number(price,
na="n/a",
locale=locale(decimal_mark = ",", grouping_mark = "."))
範例:初任人員薪資
{%>% map(class)
startSalary
# 以字串形式觀察原始資料以理解na註記形式
%>%
startSalary map(
~parse_character(as.character(.x))
)
# 去除34-36行
%>%
startSalary slice(
1:33
-> startSalary2
)
# 設定class
%>%
startSalary2 mutate(
across(
`高級中等(高中、高職)`:大學,
~parse_number(as.character(.x), na="...")
)-> startSalary3
)
# 改變數名稱
%>%
startSalary3 rename(
year=`...1`
) }
Exercise 2.3 前述區塊有許多中間物件可以透過%>%
串接而省略,請重寫該區塊並盡量減少中間物件。
- 將startSalary處理成:
- startSalary$year: integer. 西元年
- 其餘: double
- 計算NA每個變數有多少,佔比多少。(
summarise(across(...))
)
2.8 group_by
{# 全部有多少個車位
%>%
youBikeRightClass summarise(
tot=sum(tot)) -> totAll
# 各行政區有幾個車位
%>%
youBikeRightClass group_by(sarea) %>%
summarise(
tot=sum(tot)
-> totByArea
)
# 那個行政區最多/最少
%>%
totByArea arrange(tot)
%>%
totByArea arrange(desc(tot))
# 前三名
%>%
totByArea slice_max(tot, n=3)
# 前50%
%>%
totByArea slice_max(tot, prop=0.5)
# 後三名
%>%
totByArea slice_min(tot, n=3)
}
Exercise 2.4 找出每個行政區車位(tot)最多的前二站(sna)
%>%
youBikeRightClass group_by(sarea) %>%
slice_max(tot, n=2) -> top2eachArea
要小心group_by後接著的運算都是分群的結果,除非下了ungroup()指令:
is.grouped_df(top2eachArea)
{# 各區前二名總車位數
%>%
top2eachArea summarise(
tot=sum(tot)
)
# 各區前二名跨區全部加總
%>%
top2eachArea ungroup() %>%
summarise(
tot=sum(tot)
)
}
2.9 filter
{# 找出總車位數大於20的樣本
%>%
youBikeRightClass filter(
> 20
tot
)
# 找出總車位數大於20的樣本
%>%
youBikeRightClass filter(
> 20 & sarea=="汐止區"
tot
)
%>%
youBikeRightClass filter(
> 20 , sarea=="汐止區" # 或,分隔
tot
)
}
2.10 select
{%>%
youBikeRightClass select(sno:tot)
%>%
youBikeRightClass select(1:3)
%>%
youBikeRightClass select(contains("a"))
%>%
youBikeRightClass select(matches("a$"))
}
2.11 WDI database📊
World Development Indicator
2.11.1 WDI package
library(WDI)
WDIsearch(string="Inflation") -> inflation_search
inflation_search
“Inflation, consumer prices (annual %)”
<- inflation_search[1,]
target_series WDI(
indicator = target_series[["indicator"]]
-> world_inflation )
2.12 臺灣政府統計資訊網📊
2.13 臺灣政府資料開放平台📊
消費者物價基本分類指數: https://data.gov.tw/dataset/6019
2.13.1 XML data
library(xml2)
read_xml("https://www.dgbas.gov.tw/public/data/open/Stat/price/PR0101A1M.xml") -> cpiTW
XML is a special data storage format. To extract data from the file, you need to know XPath. Fortunately, in R everything can be as_list
.
::as_list(cpiTW) -> cpiTW xml2
"DataSet"]][[1]] %>%
cpiTW[[unlist()
"DataSet"]] %>%
cpiTW[[map_dfr(
unlist-> df_cpiTW )
Exercise 2.5 請將df_cpiTW
清理成econDV::data01$df_inflationTW
形式
::data01$df_inflationTW econDV
2.14 Combine tables (bind/join)
Bind:
bind_rows(...)
,bind_cols(...)
Join:
..._join(...)
2.14.1 Bind tables
bind_rows(...)
,bind_cols(...)
= {
world_inflationComplete %>%
world_inflation rename(
inflation=FP.CPI.TOTL.ZG
-> world_inflation2
)
bind_rows(
$df_inflationTW,
data01
world_inflation2
) }
2.14.2 Join tables
..._join(...)
= {
world_inflationWithTaiwan %>%
world_inflation2 left_join(
%>%
ISOcountry select(-en), # 不選en
by = c("iso2c" = "code")
-> world_inflationWithTaiwan
)
world_inflationWithTaiwan }
- 有多個合併key,
by=c("keyX1"="KeyY1", ..., "keyXn"="keyYn")
- key同名時,
by=c("key1", "key2")
2.15 Pivot
- tidyr package
2.15.1 pivot_wider
= {
world_inflationComplete_wide %>%
world_inflationComplete pivot_wider(
id_cols = c("year", "country", "inflation"),
names_from = "country",
values_from = "inflation"
->
)
world_inflationComplete_wide
world_inflationComplete_wide }
2.15.2 pivot_longer
%>% names() world_inflationComplete_wide
2.16 Options: scipen, digits
Scientific notation:
1e05
means \(10^5\);1e03
means \(10^3\);1e-04
means \(10^{-4}\).
head(world_inflationComplete_wide)
turn off scientific notation
options(scipen=999) # turn off scientific notation
head(world_inflationComplete_wide)
options(digits=4) # decimal digits = 4
head(world_inflationComplete_wide)
turn on scientific notation
options(scipen=0) # turn on scientific notation
head(world_inflationComplete_wide)
2.17 Meta data
- 使用
attr(obj, "屬性") <-
來記錄資料出處、單位等。
2.18 Drake plan📋
Drake plan of the chapter comes with econDV package
The following code make the plan happen
{library(googlesheets4)
library(econDV)
%=% read_sheet(
startSalary "https://docs.google.com/spreadsheets/d/1PPWeFGqedVsgZmV81MeA0oJDCBkj5upkHltlWap7xUY/edit#gid=1835013852",
skip = 3
)
%=% WDI::WDI(
world_inflation indicator = "FP.CPI.TOTL.ZG"
)
%=% {
df_cpiTW <- url("https://www.dropbox.com/s/8585yr3t0bbcb5b/df_cpiTW.rda?dl=1")
con load(file = con)
df_cpiTW
}
%=% jsonlite::fromJSON(
df_newTaipeiYouBike "https://data.ntpc.gov.tw/api/datasets/71CD1490-A2DF-4198-BEF1-318479775E8A/json"
# data.frame class
)
::make(plan_dataCleaning)
drake }