第 4 章 R Basics

4.1 主要參考書籍

R for Data Science: abbreviation RDS.
資料科學與R語言 by 曾意儒: abbreviation 曾意儒.

4.2 Numeric (vector)

a<-5
a2<-5L

aVector<-c(5,6,7)
a2Vector<-c(5L,6L,7L)

使用class()查詢上述物件類別。

Operations on numeric objects

曾意儒: 1.6.1-1.6.2

4.3 Character/String (vector)

b<-"你好"

bVector<-c("你好","How are you?")

使用class()查詢上述物件類別。

Understand function usage

help

Take class() as an example

class{base}

Documentation

Description
Usage
Arguments
Details
Examples

Type ?class in console window will also work.

4.4 Factor

Factors are used to work with categorical variables, variables that have a fixed and known set of possible values.

RDS: Chapter 15

資料來源：經濟系大學部98_105學年度入學學生

library(readr)
student <- read_csv("https://raw.githubusercontent.com/tpemartin/course-107-1-programming-for-data-science/master/data/student.csv")
library(dplyr)
library(magrittr)
student %<>% mutate(
  身高級距=cut(身高,c(0,150,155,160,165,170,175,180,185,200)))

入學年、出生地、性別、身高、身高級距，何者是類別變數（即具有分類的意義）？

若資料有「學號」變數，你覺得它應該是什麼class？

Change class

as.numeric(), as.character(), as.factor()

as.factor(student$出生地) -> student$出生地

將剩下的變數都轉成適當的class。

Factor has levels.

levels(student$出生地)

Count

table(student$出生地)

Ordered factor

入學年、出生地、性別、身高、身高級距，其中那些變數不同類間可以排序（即比大小）？

Use factor() to convert to ordered factor.

factor(student$身高級距,
       levels=c("(0,150]","(150,155]", "(155,160]", "(160,165]" ,"(165,170]", "(170,175]" ,"(175,180]" ,"(180,185]", "(185,200]"),
       ordered=TRUE) -> student$身高級距

You can save c("(0,150]","(150,155]", "(155,160]", "(160,165]" ,"(165,170]", "(170,175]" ,"(175,180]" ,"(180,185]", "(185,200]") as another variable, say heightLevels, and set levels=heightLevels.

heightLevels <- c("(0,150]","(150,155]", "(155,160]", "(160,165]" ,"(165,170]", "(170,175]" ,"(175,180]" ,"(180,185]", "(185,200]")
factor(student$身高級距,
       levels=heightLevels,
       ordered=TRUE) -> student$身高級距

4.5 Date and Time

參考資料：

a<-"2017-01-31"

a的class是什麼？

Date/Time class requires parsing function to help computer understand the meaning of the content.

Two different classes

date-time: “2017-11-28 12:00:00”

the number of seconds since 1970-01-01 00:00:00 UTC. (aka POSIXct)

date: “2017-11-28”

the number of days since 1970-01-01.

package: lubridate

library(dplyr)
library(lubridate)
a <- ymd("2017-01-31") 
b <- ymd_hms("2017-01-31 20:11:59")

a,b的class是什麼？

The date-time class in R is POSIXct.

ymd(),ymd_hms()會自動去猜西元年、月、日及時間的斷句規則。

a <- ymd("2017/01/31") 
b <- ymd_hms("2017-01-31 2:53:00pm")

Try help search on ymd().

Two ways to call functions

Method 1:

library(lubridate)
ymd("2017/01/31")

Method 2

lubridate::ymd("2017/01/31")

Method 1一口氣引入lubridate所有函數，且一直存在程式環境裡。Method 2沒有引入lubridate所有函數，只在那一行程式使用單單那一個函數。
兩個library可以有相同函數（如 dplyr::select及raster::select），此時用Method 2可避免使用混淆。

> sessionInfo() to see how many libraries you have attached to the environment so far.

Generate date-time sequence

a <- seq(ymd("2001-01-01"),ymd("2018-09-01"),by="month")
b <- seq(ymd("2001-01-01"),ymd("2018-09-01"),by="quarter")
c <- seq(ymd("2001-01-01"),ymd("2018-09-01"),by="year")

4.6 Operation on Strings

It is common to operate on strings such as subset, join and split. Here we only talk about some of them. We will learn more later.

Package: stringr

Subset

str_sub(),str_subset(),str_extract(),str_match()

Use > ?str_sub() to get the function help. And run examples in a code chunk.

資料：學生學號

studentID <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/studentID.csv")

利用str_sub()取出每位學生系別。

Join/Split

str_c(),str_split_fixed()

Use > ?str_c() to get the function help.

練習1

str_c(letters,LETTERS)
str_c(letters,LETTERS,sep="-")
str_c("lowercase: ", letters, ", capital: ", LETTERS)

練習2

資料：班上Github帳號資料

library(readr)
githubData <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/githubData.csv")

githubData[c(2,3,4),] -> sampleGithub
str_c(sampleGithub$`GitHub username`,
      sampleGithub$`GitHub repo name`)
str_c("https://github.com/",
      sampleGithub$`GitHub username`,
      "/",
      sampleGithub$`GitHub repo name`)

練習3

資料來源：第三屆經濟播客競賽人氣投票結果

library(readr)
filmVotingData <- read_csv("https://raw.githubusercontent.com/tpemartin/course-107-1-programming-for-data-science/master/data/%E7%AC%AC%E4%B8%89%E5%B1%86%E7%B6%93%E6%BF%9F%E6%92%AD%E5%AE%A2%E7%AB%B6%E8%B3%BD%E5%8F%83%E8%B3%BD%E4%BD%9C%E5%93%81%E6%8A%95%E7%A5%A8%E7%B5%90%E6%9E%9C%E6%A8%A3%E6%9C%AC%20-%20Sheet1.csv")

A. 請在filmVotingData創造出單純「西元年」變數。

B. 每位投票者最多可選兩部喜歡的影片(有人只選一部)，請問你如何計算每部影片有多少人選？

4.7 Taiwan date-time

台灣的資料常為民國年月，又因為其資料記載方式，初次引入的年月資料常如下所示：

民國年月<-c("099/01","099/02","099/03")

將上述資料轉成西元年月格式（民國年+1911即為西元年）的date class。你的程式設計策略為何？

(hint: date class的變數可以用+years(k)把西元年增加k年。同學也可以查查lubridate::years()用法)

4.8 練習題

線上練習

線上練習網址：https://garylkl.shinyapps.io/Chapter4/

作業

請連到作業repo取得作業內容：

作業repo下載方式

以下本學期只需執行一次

連到GitHub.com 登入。
連到作業repo，執行Fork：

使用學校電腦：以下每次都要執行

打開GitHub Desktop，

執行Clone Repository：
選剛才的Fork的repo，接著按Clone.

使用自己電腦：以下只執行一次

如同使用學校電腦的步驟，但你只需做一次，以後無需再做。

如何收新的作業

在前面步驟該做都有做的情況下，每次都要進行以下步驟, 以更新你載回的作業內容：

首先打開GitHub Desktop軟體，確認你是否有登入。

點Fetch origin檢查老師作業最新狀態。
從選單Branch->Merge Into Current Branch.
選upstream/master為更新來源，並按Merge into master確認。
點Push origin更新你在雲端的資料。