第 2 章 Data cleaning

2.1 Data frame

  • A special type of list:

    • each element MUST be an atomic vector and have the same length (aka number of observations). (atomic vector requirement can be dropped in an extended data frame class, tibble)
  • More extraction methods and dplyr provides many useful tools.

  • It is still a list. So you can use purrr package as well.

2.2 Extraction

  • All $, [[]], [] apply to data frame as apply to list.

  • Rectangular object, such as data frame and matrix, has one more extraction tool [row_index, column_index]

2.3 dplyr

Probably the most widely used package in R. It has a key feature.

  • All dplyr function take data frame as its first argument, and return a data frame.

  • In dplyr function, operations on variables just express as operation on variables directly. No need to do extraction from data frame.

Example

DF=data.frame(
  y=c(1,3,5),
  x=c(0,-1,1)
)

We want to add z to DF where \[z=y+x^2,\] and then take mean of z.

# without dplyr
{
  DF$z=DF$y+(DF$x)^2
  
  mean(DF$z)
}

# with dplyr
{
  mutate(DF, z=y+x^2) -> DF
  summarise(DF, meanz=mean(z)) -> DF_meanz
}

# Using pipe operator
{
  DF %>%
    mutate(z=y+x^2) %>%
    summarise(meanz=mean(z))
}

With DF to be its first arg and return, you can use %>% continuously until you don’t need dplyr’s assistance.

可以不需要從母物件Extraction就可以對子元素運算還有:

  • with(data, expression): 使用data裡的元素去完成expression.

  • within(data, expression): 使用data裡的元素去完成expression, expression裡新創的物件都會加到data裡

with(DF,{
  w=y-x^2
  w # 沒這行便沒有回傳值
  } )

within(DF, 
       {
         w=y-x^2
         q=3*y+w
         })
  • 兩者也可以用在list.

2.3.1 YouBike data

browseURL("https://data.ntpc.gov.tw/datasets?keyword=%E6%96%B0%E5%8C%97%E5%B8%82%E5%85%AC%E5%85%B1%E8%87%AA%E8%A1%8C%E8%BB%8A%E7%A7%9F%E8%B3%83%E7%B3%BB%E7%B5%B1%28YouBike%29")
targetUrl <- "https://data.ntpc.gov.tw/api/datasets/71CD1490-A2DF-4198-BEF1-318479775E8A/json"
library(jsonlite)
list_newTaipeiYouBike <- fromJSON(
  targetUrl, simplifyDataFrame = F) # list class
df_newTaipeiYouBike <- fromJSON(
  targetUrl) # data.frame class
list_newTaipeiYouBike %>% typeof()
df_newTaipeiYouBike %>% typeof()
list_newTaipeiYouBike %>% class()
df_newTaipeiYouBike %>% class()
methods(class="list")
methods(class="data.frame")

2.4 Check data (mutate/transmute)

2.4.1 Class

  • Proper class will make your data exploration much easier, no matter visually or not.

sno(站點代號)、sna(中文場站名稱)、tot(場站總停車格)、sbi(可借車位數)、sarea(中文場站區域)、mday(資料更新時間)、lat(緯度)、lng(經度)、ar(中文地址)、sareaen(英文場站區域)、snaen(英文場站名稱)、aren(英文地址)、bemp(可還空位數)、act(場站是否暫停營運)

df_newTaipeiYouBike %>%
  str()

2.4.2 mutate/transmute

df_newTaipeiYouBike %>%
  mutate(
    tot=as.integer(tot),
    sbi=as.integer(sbi),
    bemp=as.integer(bemp),
    車位落差=tot-(sbi+bemp), # 創立新變數
    有落差=(車位落差!=0) # 前面出現過的可立刻拿來運算
  ) -> df_newTaipeiYouBike2

df_newTaipeiYouBike %>%
  transmute( # 只留下這裡的變數
    tot=as.integer(tot),
    sbi=as.integer(sbi),
    bemp=as.integer(bemp),
    車位落差=tot-(sbi+bemp), # 創立新變數
    有落差=(車位落差!=0) # 前面出現過的可立刻拿來運算
  ) -> df_newTaipeiYouBike3

檢視

{
  View(df_newTaipeiYouBike2)
  View(df_newTaipeiYouBike3)
}

2.5 Block programming (summarise/across)

Use {} to turn a multiple-steps task into a block.

taskName = {
  ...
}
  1. The task can be re-used easily when we talk about drake.

  2. The name can also represent a valued object. The value of that object will be the LAST LINE produced value ONLY IF that line will show value on your screen.

  • Useful when you later want to turn it into function body, a plan, etc.
viewDataFrames = { 
  View(df_newTaipeiYouBike2)
  View(df_newTaipeiYouBike3)
}
viewDataFrames
viewDataFrames = { 
  View(df_newTaipeiYouBike2)
  View(df_newTaipeiYouBike3)
  df_newTaipeiYouBike2
}
viewDataFrames

Exercise 2.1 請將df_newTaipeiYouBike所有的變數改成合理的class。

2.5.1 Missing value

  • What is the symbol for missing value? n/a, -9999, etc.

    • Convert those symbols into NAs in R.
  • How many NA? Percentage in sample?

2.5.2 summarise

df_newTaipeiYouBike %>%
  summarise(
    sna=sum(is.na(sna)),
    sno=sum(is.na(sno))
  )

# Use %>% to observe formula body
df_newTaipeiYouBike %>%
  summarise(
    sna=sna %>% {sum(is.na(.))},
    sno=sno %>% {sum(is.na(.))}
  )

2.5.3 across

across(.cols, .fns) tidy-select:

Arguments:

  • .cols follow tidy-select, such as

    • 針對變數varm到varq間欄位: .cols=varm:varq selects all variables from varm to varq.

    • 針對所有欄位:.cols=everythins()

    • 針對AAA及BBB兩個欄位: .cols=all_of(c("AAA","BBB"))

    • 針對AAA及BBB任何欄位:.cols=any_of(c("AAA","BBB")) , 用在不確定哪個欄位會在,但有在的話就針對它; 相對於all_of,any_of就算寫到不存在的欄位也不會有error。

    • 針對is.numeric函數回傳為T的欄位: .cols=where(is.numeric) .

    • 針對欄位名稱有area字眼的欄位:.cols=contains("area")

    • 針對橍位名稱滿足regex “a$”的欄位: .cols=matches("a\$")

data01$df_newTaipeiYouBike ->df_newTaipeiYouBike
names(df_newTaipeiYouBike)
df_newTaipeiYouBike %>%
  transmute(
    across(
      .cols=tot:mday,
      .fns = ~.x
    )
  )

df_newTaipeiYouBike %>%
  transmute(
    across(
      .cols=any_of(c("aaa", "sna")),
      .fns = ~.x
    )
  ) 
  • 選出來的欄位可進一步

    • & 交集,| 連集, ! 補集合, c() 多群集合。
df_newTaipeiYouBike %>%
  transmute(
    across(
      .cols= contains("a") & sbi:sareaen,
      .fns= ~.x
    )
  )
# summarise(., across(selector, functions))
df_newTaipeiYouBike %>%
  summarise(
    across(
      everything(),
      ~{sum(is.na(.x))}
    )
  )

2.5.4 應用

sno(站點代號)、sna(中文場站名稱)、tot(場站總停車格)、sbi(可借車位數)、sarea(中文場站區域)、mday(資料更新時間)、lat(緯度)、lng(經度)、ar(中文地址)、sareaen(英文場站區域)、snaen(英文場站名稱)、aren(英文地址)、bemp(可還空位數)、act(場站是否暫停營運)

修正class

youBikeRightClass = { # deal with class
  df_newTaipeiYouBike %>% map(class)
  
  numericVs <- c("tot","sbi","lat","lng")
  
  df_newTaipeiYouBike %>% 
    mutate(
      across(
        all_of(numericVs),
        as.numeric
      )
    ) -> df_newTaipeiYouBike2
  
  factorVs <- c("sarea")
  df_newTaipeiYouBike2 %>%
    mutate(
      across(
        all_of(factorVs),
        as.factor
      )
    ) -> df_newTaipeiYouBike3
  
  df_newTaipeiYouBike3 %>%
    mutate(
      mday=ymd_hms(mday)
    )
}

2.5.4.1 檢查NA

checkNA={
  totNA <- function(.x) sum(is.na(.x))
  smplSize <- function(.x) length(.x) - totNA(.x)
  youBikeRightClass %>%
    summarise(
      across(
        everything(),
        list(na=totNA, size=smplSize) # na, size會成為欄位名稱一部份
      )
    )
}

Exercise 2.2 請創造一個函數:mutate_class(df, numericVs, factorVs)

  • df: A data frame.
  • numericVs: A character vector of variable names that should be numeric class
  • factorVs: A character vector of variable names that should be factor class

Return: A data frame with correct class setting

If your source provides a data frame structure, use Import Dataset > From Text(readr) to import your data.

2.6 勞動統計查詢網📊

Our gov’t uses xlsx format a lot. However, sometimes you will encounter some error. An easy way to solve it is to upload to Google Sheets which read almost everything with accuracy. Then use googlesheets4 package to import.

library(googlesheets4)

read_sheet(
  "https://docs.google.com/spreadsheets/d/1PPWeFGqedVsgZmV81MeA0oJDCBkj5upkHltlWap7xUY/edit#gid=1835013852", 
  skip=3
) -> startSalary
  • If not working, in Console, try
gs4_deauth()
gs4_auth()

One challenge to data science project is the balance between data exploration and data I/O. econDV::%=% will check if the object already there before import (and then stored in R.cache).

library(econDV)
startSalary %=% read_sheet(
  "https://docs.google.com/spreadsheets/d/1PPWeFGqedVsgZmV81MeA0oJDCBkj5upkHltlWap7xUY/edit#gid=1835013852", 
  skip=3
)

2.7 Parsing functions

  • Though as.{class} functions are good,readr::parse_... functions are even better. If something goes wrong, it gives you clear information.

readr package has several parse_… function to parse characters into different types of vectors:

  • parse_double, parse_character, … , etc

Arguments:

  • na=c(....) defines what counts as NA.

  • locale=locale(...) where locale(...) creates an object that defines and interprets the string accordingly.

    • locale(decimal_mark=".", grouping_mark=","): 小數點為.,千分位點為, 。
price <- c("$10.000,23", "$999,15", "n/a")
parse_number(price,
             na="n/a",
             locale=locale(decimal_mark = ",", grouping_mark = "."))

範例:初任人員薪資

{
  startSalary %>% map(class)
  
  # 以字串形式觀察原始資料以理解na註記形式
  startSalary %>%
    map(
      ~parse_character(as.character(.x))
    )

  # 去除34-36行
  startSalary %>%
    slice(
      1:33
    ) -> startSalary2
  
  # 設定class
  startSalary2 %>%
    mutate(
      across(
        `高級中等(高中、高職)`:大學,
        ~parse_number(as.character(.x), na="...")
      )
    ) -> startSalary3
  
  # 改變數名稱
  startSalary3 %>%
    rename(
      year=`...1`
    )
}

Exercise 2.3 前述區塊有許多中間物件可以透過%>%串接而省略,請重寫該區塊並盡量減少中間物件。

  1. 將startSalary處理成:
  • startSalary$year: integer. 西元年
  • 其餘: double
  1. 計算NA每個變數有多少,佔比多少。(summarise(across(...))

2.8 group_by

{
  # 全部有多少個車位
  youBikeRightClass %>%
    summarise(
      tot=sum(tot)) -> totAll
  
  # 各行政區有幾個車位
  youBikeRightClass %>%
    group_by(sarea) %>%
    summarise(
      tot=sum(tot)
      ) -> totByArea

  # 那個行政區最多/最少
  totByArea %>%
    arrange(tot) 
  totByArea %>%
    arrange(desc(tot))
  # 前三名
  totByArea %>%
    slice_max(tot, n=3)
  # 前50%
  totByArea %>%
    slice_max(tot, prop=0.5)
  # 後三名
  totByArea %>%
    slice_min(tot, n=3)
  
}

Exercise 2.4 找出每個行政區車位(tot)最多的前二站(sna)

youBikeRightClass %>%
  group_by(sarea) %>%
  slice_max(tot, n=2) -> top2eachArea

要小心group_by後接著的運算都是分群的結果,除非下了ungroup()指令:

is.grouped_df(top2eachArea)
{
  # 各區前二名總車位數
  top2eachArea %>%
    summarise(
      tot=sum(tot)
    )
  
  # 各區前二名跨區全部加總
  top2eachArea %>%
    ungroup() %>%
    summarise(
      tot=sum(tot)
    )
  
}

2.9 filter

{
  # 找出總車位數大於20的樣本
  youBikeRightClass %>%
    filter(
      tot > 20
    )
  
  # 找出總車位數大於20的樣本
  youBikeRightClass %>%
    filter(
      tot > 20 & sarea=="汐止區"
    )
  
  youBikeRightClass %>%
    filter(
      tot > 20 , sarea=="汐止區" # 或,分隔
    )

}

2.10 select

{
  youBikeRightClass %>% 
    select(sno:tot)
  youBikeRightClass %>% 
    select(1:3)
  youBikeRightClass %>% 
    select(contains("a"))
  youBikeRightClass %>% 
      select(matches("a$"))
}

2.11 WDI database📊

2.11.1 WDI package

library(WDI)
WDIsearch(string="Inflation") -> inflation_search
inflation_search

“Inflation, consumer prices (annual %)”

target_series <- inflation_search[1,]
WDI(
  indicator = target_series[["indicator"]]
) -> world_inflation

2.11.2 ISO country code

econDV::ISOcountry

2.12 臺灣政府統計資訊網📊

2.13 臺灣政府資料開放平台📊

2.13.1 XML data

library(xml2)
read_xml("https://www.dgbas.gov.tw/public/data/open/Stat/price/PR0101A1M.xml") -> cpiTW

XML is a special data storage format. To extract data from the file, you need to know XPath. Fortunately, in R everything can be as_list.

xml2::as_list(cpiTW) -> cpiTW
cpiTW[["DataSet"]][[1]] %>%
  unlist()

cpiTW[["DataSet"]] %>%
  map_dfr(
    unlist
  ) -> df_cpiTW

Exercise 2.5 請將df_cpiTW清理成econDV::data01$df_inflationTW形式

econDV::data01$df_inflationTW

2.14 Combine tables (bind/join)

  • Bind: bind_rows(...), bind_cols(...)

  • Join: ..._join(...)

2.14.1 Bind tables

  • bind_rows(...), bind_cols(...)
world_inflationComplete = {
  world_inflation %>%
    rename(
      inflation=FP.CPI.TOTL.ZG
    ) -> world_inflation2
  
  bind_rows(
    data01$df_inflationTW,
    world_inflation2
  )
}

2.14.2 Join tables

  • ..._join(...)
world_inflationWithTaiwan = {
  world_inflation2 %>%
    left_join(
      ISOcountry %>%
        select(-en), # 不選en
      by = c("iso2c" = "code")
    ) -> world_inflationWithTaiwan
  world_inflationWithTaiwan
}
  • 有多個合併key,
by=c("keyX1"="KeyY1", ..., "keyXn"="keyYn")
  • key同名時,
by=c("key1", "key2")

2.15 Pivot

  • tidyr package

2.15.1 pivot_wider

world_inflationComplete_wide = {
  world_inflationComplete %>%
    pivot_wider(
      id_cols = c("year", "country", "inflation"),
      names_from = "country",
      values_from = "inflation"
    ) ->
  world_inflationComplete_wide
  world_inflationComplete_wide
}

2.15.2 pivot_longer

world_inflationComplete_wide %>% names()

2.16 Options: scipen, digits

Scientific notation:

  • 1e05 means \(10^5\); 1e03 means \(10^3\); 1e-04 means \(10^{-4}\).
head(world_inflationComplete_wide)

turn off scientific notation

options(scipen=999) # turn off scientific notation
head(world_inflationComplete_wide)
options(digits=4) # decimal digits = 4
head(world_inflationComplete_wide)

turn on scientific notation

options(scipen=0) # turn on scientific notation
head(world_inflationComplete_wide)

2.17 Meta data

  • 使用attr(obj, "屬性") <-來記錄資料出處、單位等。

2.18 Drake plan📋

  • Drake plan of the chapter comes with econDV package

  • The following code make the plan happen

{
  library(googlesheets4)
  library(econDV)

  startSalary %=% read_sheet(
    "https://docs.google.com/spreadsheets/d/1PPWeFGqedVsgZmV81MeA0oJDCBkj5upkHltlWap7xUY/edit#gid=1835013852",
    skip = 3
  )

  world_inflation %=% WDI::WDI(
    indicator = "FP.CPI.TOTL.ZG"
  )

  df_cpiTW %=% {
    con <- url("https://www.dropbox.com/s/8585yr3t0bbcb5b/df_cpiTW.rda?dl=1")
    load(file = con)
    df_cpiTW
  }

  df_newTaipeiYouBike %=% jsonlite::fromJSON(
    "https://data.ntpc.gov.tw/api/datasets/71CD1490-A2DF-4198-BEF1-318479775E8A/json"
  ) # data.frame class

  drake::make(plan_dataCleaning)
}