Chapter 5 Programming Design
5.1 For each
5.1.1 For each observation
<- list(
dataSet list(
gender="male", height=178, age=32,
children=list(
list(gender="male", age=5),
list(gender="male", age=0))
),list(
gender="female", height=166, age=30,
children=list(
list(gender="female", age=3))
),list(
gender="male", height=170, age=42,
children=list(
list(gender="male", age=10),
list(gender="female", age=8))
),list(
gender="male", height=188, age=22,
children=list()
), list(
gender="male", height=172, age=52,
children=list(
list(gender="female", age=25),
list(gender="female", age=23))
),list(
gender="female", height=160, age=42,
children=list(
list(gender="female", age=11))
) )
For each observation, how many children does each have?
<- list()
analysis $number_of_children <- list()
analysis
$number_of_children[[1]] <-length(dataSet[[1]]$children)
analysis$number_of_children[[2]] <-length(dataSet[[2]]$children)
analysis$number_of_children[[3]] <-length(dataSet[[3]]$children)
analysis$number_of_children[[4]] <-length(dataSet[[4]]$children)
analysis$number_of_children[[5]] <-length(dataSet[[5]]$children)
analysis$number_of_children[[6]] <-length(dataSet[[6]]$children) analysis
5.1.2 Information container
analysis$number_of_children
is an information container to contain all the pieces of information that you want. Here, we use declare-then-add method to save our task information. When declaring an object, only the storage type is determined, but not its length. In computer science,
- if the number of pieces of information (not each piece’s value) is already known,
it is more efficient to declare a fixed-length object with its length equal to the number of pieces of information first.
In R, we can do this via:
vector("storage_type", length)
So instead of
$number_of_children <- list() analysis
we should do
$number_of_children <- vector("list", length=6) analysis
<- list()
analysis $number_of_children <- vector("list", length=6)
analysis
$number_of_children[[1]] <-length(dataSet[[1]]$children)
analysis$number_of_children[[2]] <-length(dataSet[[2]]$children)
analysis$number_of_children[[3]] <-length(dataSet[[3]]$children)
analysis$number_of_children[[4]] <-length(dataSet[[4]]$children)
analysis$number_of_children[[5]] <-length(dataSet[[5]]$children)
analysis$number_of_children[[6]] <-length(dataSet[[6]]$children) analysis
5.1.3 For loop
When for each… statement appears, you will see similar task repeated several times, the only difference is the position index. In computer science, for loop is a way to express repeating a task several times.
What task? (Task pattern)
How to repeat? (Iterate generation)
In our earlier task, there was a pattern repeated 6 times. If we replaced position 1, 2, …, 6 with .x
, all we see will be:
$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children) analysis
Then the task pattern is:
$number_of_children[[.x]] <-length(dataSet[[.x]]$children) analysis
For convenience of our discussion, let’s call it .x-pattern, where .x
is called iterate.
We need computer to generate iterate from 1 to 6, and repeat .x-pattern each time with .x replaced with the iterate generated. We can use
for(.x in c(1, 2, 3, 4, 5, 6)){.x-pattern}
to do the job. This is called for loop.
for(.x in c(1, 2, 3, 4, 5, 6)){
# .x-pattern
$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis }
The complete for loop includes the container declaration:
<- list()
analysis $number_of_children <- vector("list", length=6)
analysisfor(.x in c(1, 2, 3, 4, 5, 6)){
# .x-pattern
$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis
}
class(analysis$number_of_children) # a list
Since c(1, 2, 3, 4, 5, 6)
is based on the length of dataSet
, we can express iterate generation as:
in 1:length(dataSet) .x
A better way is to use:
in seq_along(dataSet) .x
seq_along(.)
does almost the same job as1:length(.)
except whendataSet
length is zero onlyseq_along(.)
would generate correct iterate (i.e. no iteration).
The best version now would be:
<- list()
analysis $number_of_children <- vector("list", length=6)
analysisfor(.x in seq_along(dataSet)){
# .x-pattern
$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis
}
class(analysis$number_of_children) # a list
The container creator:
vector("storage_type", length)
is to create container based on type – not on class. If you need your information to be certain class, you can always do parsing later – after you collect all the information.
<- c('1999-03-16','1996-07-14','1998-11-06','1999-07-16','1995-09-30','1997-09-08','1998-08-08','1999-03-26','1999-02-13','1997-07-04')
dataSet1
# declare a fixed length container based on information type before parsing
<- vector("character", length(dataSet1))
birthdays for(.x in seq_along(dataSet1)){
<- dataSet1[[.x]]
birthdays[[.x]]
}# parse your collected information afterward.
|> lubridate::ymd() -> birthdays birthdays
Exercise 5.1 Load the following data set:
# Observation by observation
<-
concerts_obo ::fromJSON("https://cloud.culture.tw/frontsite/trans/SearchShowAction.do?method=doFindTypeJ&category=17", simplifyDataFrame = F) jsonlite
Data set concerts_obo
has difference performance information as its observations. For each performance, there can be more than one show, and each show’s information is contained inside the observation’s showInfo
element.
How many shows does the 5th performance have?
If we are to find out the number of shows for each performance, what is the .x-pattern?
Find out the number of shows for each performance, and put the answer as an integer vector under
analysis$number_of_shows
. (i.e.analysis$number_of_shows[[5]]
will show your answer in 1.)Construct a data frame with columns, title, date, location and locationName, where title comes from each observation’s title, and the rest three are from each observation’s 1st showInfo. In addition, date column has to be “POSIXct” “POSIXt” class with Taipei time zone.
=NA
flag
if(as.logical(flag)){print("T")}else{print("F")}
5.1.4 For each group
<-
dataSet1 data.frame(
team=c('team 2','team 3','team 1','team 2','team 3','team 1','team 2','team 1','team 3','team 2','team 3','team 1','team 3','team 1','team 2'),
entranceYear = c(100,100,101,100,100,100,101,100,100,101,101,100,101,100,100),
grade=c(98,95,76,87,67,83,84,94,70,64,99,77,89,62,65)
)
For each team, compute its mean grade.
Solving the task requires first:
- Form grouped data set: split your data set into different grouped data sets (i.e. the data set whose observations share certain feature value) based on some feature (i.e. team here).
Then
- For each grouped data set, compute its mean grade.
Form grouped data set
split(dataSet1, dataSet1$team) -> grouped_dataset1
View(grouped_dataset1)
split(x, f)
wheref
is supposed to be a factor–sodataSet1$team
will be converted to factor behind your back.
Dealing with each grouped data set
For each grouped data set, compute its mean grade.
<- vector("double", length(grouped_dataset1))
group_meanGrades
1]] <- mean(grouped_dataset1[[1]]$grade)
group_meanGrades[[2]] <- mean(grouped_dataset1[[2]]$grade)
group_meanGrades[[3]] <- mean(grouped_dataset1[[3]]$grade) group_meanGrades[[
Use for loop:
<- vector("double", length(grouped_dataset1))
group_meanGrades for(.x in seq_along(grouped_dataset1)){
<- mean(grouped_dataset1[[.x]]$grade)
group_meanGrades[[.x]] }
Put all together. To solve:
For each team, compute its mean grade.
# Form grouped data sets
split(dataSet1, dataSet1$team) -> grouped_dataset1
# For each data sets, solve the required problem
<- vector("double", length(grouped_dataset1))
group_meanGrades for(.x in seq_along(grouped_dataset1)){
<- mean(grouped_dataset1[[.x]]$grade)
group_meanGrades[[.x]] }
Exercise 5.2 Regarding the drug
data from Exercise 4.8.
For each kind of drug (
毒品品項
), what was the average amount of drug (the mean of數量(淨重)_克
) found?Add a column
date
to the data frame which is a date class created from the column發生日期
For the following question, we only consider those 毒品品項
that contains the words “安非他命”.
- For each year, what was the total amount found? (Apply
lubridate::year()
todate
column will get you year)
Multiple grouping factors
Sometimes we want to split data set based on more than one factors. In this case we need to create the interaction of these factors using :
operator.
Split dataSet1
based on team
and entranceYear
features.
$team |> factor() -> dataSet1$team
dataSet1$entranceYear |> factor() -> dataSet1$entranceYear
dataSet1$team:dataSet1$entranceYear -> .interaction
dataSet1print(.interaction)
|>
dataSet1 split(.interaction) -> grouped_dataSets2
View(grouped_dataSets2)
5.2 Systematic analysis
It’s common for data analysis to follow certain procedures in its data summarisation stage.
<-
dataSet3 data.frame(
dates = c("2016-11-15", NA, NA, "1997-05-07", "1995-08-25", "2002-09-20", NA, NA, NA, "1995-07-16", "2011-06-22"),
grades = c(29, 53, 26, 27, 55, 69, NA, NA, 63, NA, 56),
genders = c("Male", "Female", "Male", "Male", "Female", "Female", NA, "Male", "Male", "Female", "Female"),
majors = c("economics", "economics", NA, "economics", "economics", "economics", "economics", "statistics", "law", "economics", "law")
)
# class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
class_check
# NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
na_check
The above two are very common to be applied to most data sets. It would be nice if instead of a long procedure:
# class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
class_check
# NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
na_check
we can do something like:
<- check_class(dataSet3)
class_check <- check_na(dataSet3) na_check
The above lines need three things to happen:
class_check <-
andna_check <-
need values to bind with, socheck_class(dataSet3)
andcheck_na(dataSet3)
must return a value. (function return value)Both
check_na
andcheck_class
functions must be searchable in your current working space. (check the Environment tab.) (function search path)check_class(dataSet3)
andcheck_na(dataSet3)
mean both functions are called forward to work with the value ofdataSet3
. (Computer science jargon is function call.) They must have sufficient objects to complete their attached programming block. (execution environment)
5.2.1 Function as reusable programming block
Function in general is a portable/reusable programming block.
Standard programming blocks:
# checking blocks for dataSet3
<- {
class_check # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check
}
<- {
na_check # NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
# the last executed line
na_check }
- The names to bind the visible value of the last executed line (i.e.
class_check <- {...}
andna_check <- {...}
) can be any name besideclass_check
andna_check
.
If we have another data set needs to do those two checks:
= data.frame(
dataSet4 x = c(6,10,10,6,10,6,NA,NA,3,7),
y = c(2,3,6,6,8,2,8,6,8,3)
)
Unless the new data set is named dataSet3
, you cannot re-run the previous check chunk.
- You have to replace
dataSet3
withdataSet4
if you simply copy/paste the checking programing blocks:
= data.frame(...)
dataSet3 # checking blocks for dataSet3
:
= data.frame(...)
dataSet4 <- dataSet4
dataSet3 # checking blocks for dataSet3
We can make both programming blocks portable by creating functions:
<- function(){
check_class # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check
}
<- function(){
check_na # NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
# the last executed line
na_check }
<- function(){...}
will bind the programming block with the name you specified. Here arecheck_class
andcheck_na
.The last executed line will return a value if it is a name call or
return(name)
:
<- function(){
check_class
...return(class_check) # the last executed line returned
}<- function(){
check_na
...return(na_check) # the last executed line returned
}
Whenever we need to re-use the programming blocks, we just do a function call.
# function definition
<- function(){...}
check_class <- function(){...}
check_na
# a long program
= data.frame(...)
dataSet3 ## need check here
<- check_class()
class_check1 <- check_na()
na_check1 ## other part of the program
= data.frame(...)
dataSet4 = dataSet4
dataSet3 ## need check here
<- check_class()
class_check2 <- check_na() na_check2
It is equivalent to:
# a long program
= data.frame(...)
dataSet3 ## need check here
<- {
class_check1 # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check
}
<- {
na_check1 # NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
# the last executed line
na_check
}## other part of the program
= data.frame(...)
dataSet4 = dataSet4
dataSet3 ## need check here
<- {
class_check2 # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check
}
<- {
na_check2 # NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
# the last executed line
na_check }
Several tips to improve your programming efficiency.
Whenever your programming lines are about solving a specific task, use programming blocks. Later if you want to reuse the programming block, you will know where to locate it and make it into a function.
Your function name should start with a verb since it is about doing something.
Exercise 5.3 It is very common that economic model has some equilibrium condition as the intersection of two straight lines, like demand and supply. They can be expressed as: \[\mbox{demand: }p_d(q)=a+bq\\ \mbox{supply: }p_s(q)=c+dq \] In equilibrium, \[(p*, q*)\mbox{ where }p_d(q*)=p_s(q*)\mbox{ and }p*=p_d(q*)=p_s(q*).\]
If you solve for \(q*\) and \(p*\), \[q*=(a-c)/(d-b)\\ p*=a+bq*.\]
- Run the following code and solve for equilibrium and save it in a list with two element names
q_star
andp_star
. Each has its value in corresponding to \(q*\) and \(p*\) that your program solved.
=1; b=-3; c=0; d=1 a
- Create a function called solve_equilibrium so that the following code would work:
=1; b=-3; c=0; d=1
a<- solve_equilibrium()
equilibrium1 print(equilibrium1)
=2; b=-3; c=0; d=1
a<- solve_equilibrium()
equilibrium2 print(equilibrium2)
=1; b=-3; c=0; d=1.5
a<- solve_equilibrium()
equilibrium3 print(equilibrium3)
5.2.2 Function on search path
When we need to run the following codes:
# first check_class check_na function call
<- data.frame(...)
dataSet3 check_class()
check_na()
check_class
andcheck_na
have to be on the search path in your R session, which normally starts with Global Environment, followed by different packages.RStudio: Environment window, Global Environment dropdown.
check_class
and check_na
are function names. check_class()
and check_na()
that uses the programming block that function carries are called function call.
To place your function in the global environment, you can do one of the two approaches.
Approach 1.
Define in the main script before their function call
<- function(){
check_class # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check
}<- function(){...}
check_na
# first check_class check_na function call
<- data.frame(...)
dataSet3 check_class()
check_na()
Even though the programming block (aka function body) requires an object named
dataSet3
in it. Those objects inside the function body does not have to exist at the time of definition.- R won’t check the existence of objects inside the programming block at the time of function creation. The existence will be checked only at the time of the function call. This property is called lazy evaluation in computer science.
5.2.2.1 Approach 2.
Because function definition does not need any object inside the programming block to exist, they can be saved in a totally different file and used later in a totally different environment.
Save all your function in a separate .R
script file, say myFunctions.R
, then source the script file before the first function call used.
source("myFunctions.R")
# first check_class check_na function call
<- data.frame(...)
dataSet3 check_class()
check_na()
- This approach is better for organizing your program, and keep your main script flow more logically fluent.
5.2.3 Environments
After functions are on your search path, it’s time to understand how a function call is processed. To have a clear understanding, let’s imagine a new R session where the global environment is empty. Then run the following lines:
= 1
x = 33
y
<- function(){
robot1 +y
xbrowser()
}
So the global environment is populated with two objects x
and y
and one robot robot1
. (Indeed, function is more like a robot in real life!)
Now do a function call
robot1()
When objects are imported from the global environment, it will make a copy in its execution environment, and run the function body inside it.
Because objects in execution environment imported from Global environment are copies. Function body that changes object values will affect only the copy object but not the source object in the Global environment.
<- function(){
change_x_1stValue2a 1] <- "a"
x[
}
=c("b", "b", "b")
xchange_x_1stValue2a() # the copy of x in execution environment is changed
# source x is unchanged x
If you do want to change x
value in the global environment, you need to do element value replacement, ie.
= new_x # where new_x must be a return from change_x_1stValue2a function x
Modify change_x_1stValue2a
function
<- function(){
change_x_1stValue2a 1] <- "a"
x[return(x)
}
=c("b", "b", "b")
x=change_x_1stValue2a()
x x
5.2.4 Input arguments
Sometimes you don’t want certain objects in function body to be copied and imported from Global environment, but to create a new one inside the execution environment. You can put those object name in function input arguments.
<- function(x){
robot2 +y
xbrowser()
}
Now do the function call
robot2(x=20)
Caller hands in a value for
robot2
to bind withx
in the execution environment. Or you can imagine caller tellsrobot2
whatx
is, thenrobot2
creates objectx
directly inside the execution environment – not copy/import from the global environmentx
.Only y is copied/imported from the global environment.
Put object name as an input argument when creating the function can isolate the value of the same-name object in the global environment.
Earlier we have
= data.frame(...)
dataSet4 <- dataSet4
dataSet3 check_class()
If we put dataSet3
as check_class’s input argument, that is
<- function(dataSet3){
check_class # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check }
Then the function called to work on dataSet4
can be more straight forward:
= data.frame(...)
dataSet4 check_class(dataSet3 = dataSet4)
The input specification of the function call will not affect
dataSet3
in the global environment. It simply tellscheck_class
robot how to create adataSet3
object in the execution environment directly.dataSet3
will not be copied/imported from the Global environment.
In practice, input argument are object we want to create directly inside the execution environment under caller’s specification. Usually we will name the input argument in a more generous way instead of dataSet3
, like:
<- function(data_set){
check_class # class check
<- vector("integer", length(data_set))
class_check for(.x in seq_along(data_set)){
<- class(data_set[[.x]])
class_check[[.x]]
}names(class_check) <- names(data_set)
# the last executed line
class_check }
In function definition,
Input arguments ensure that those argument objects value are assigned from the function call, rather than copied/imported from the global environment.
For those objects in function body, they must be either defined by the function call (i.e. input argument), borrowed from the global environment, or an intermediate step by product inside the body.
<- function(x){
myfun +y -> z
x**2+6
z }
- x will be defined in function call. y will be imported from the global environment, z is a byproduct of intermediate steps.
myfun(3)
=5
ymyfun(3)
5.2.5 Advanced Concept of Environments
There are three places regarding function that are important:
where it is lives,
where it is called for a task,
where it solves the task.
function environment
Place 1 is about where on R’s searching path we can find the function.
<- function(){
myFun +y
x }
- We can find
myFun
in the global environment. So the function environment ofmyFun
is the global environment.
caller environment
Place 2 is called caller environment. It is where the function call happens. The function’s caller environment will be attached to its function environment as part of search path for the function to have the job done
Think of a function as a plumber whose lives in Taichung (function environment), gets a job call from Taipei (caller environment). When he needs some tools to do his job, he will start from his familiar Taichung to find those tools. If not found, he will try to find in Taipei.
myFun(x=3)
The function call happens in the global environment as well.
myFun
plumber lives and gets a job in global environment.
execution environment
Place 3 is called execution environment. It is where the task content (aka function body) is done.
When a function does the job as the function call was specified, it creates a new environment to do the work – like an office. The environment is almost empty (no object to use), but it is attached to the function environment.
In many cases, at least for this course, those functions that we create live and get a job call in the same place, i.e. the Global environment. And the execution environment office is a place attached to the global environment. When the office needs some object for the job, but the object is not in the office, the function will go to global environment to look for it.
5.2.6 Extract function
Functions that import objects from Global environment can create future programming problem.
<- function() xx + 5
myfun
# forget myfun needs to import xx
myfun()
# there is a xx totally for different purpose
=c("a", "b")
xxmyfun()
A good principle is unless you know what you are doing, otherwise never import objects from the global environment in function execution.
- Always put objects the body needs (except the by product) in function arguments.
<- function(dataSet3) {
check_class # class check
<- vector("integer", length(dataSet3))
class_check for(.x in seq_along(dataSet3)){
<- class(dataSet3[[.x]])
class_check[[.x]]
}names(class_check) <- names(dataSet3)
# the last executed line
class_check
}
<- function(dataSet3){
check_na # NA check
<- vector("integer", length(dataSet3))
na_check for(.x in seq_along(dataSet3)){
<- sum(is.na(dataSet3[[.x]]))
na_check[[.x]]
}names(na_check) <- names(dataSet3)
# the last executed line
na_check }
RStudio, select the programming block, then Code > Extract Function to create a function whose execution never import any object value from the global environment.
- Remember to check the input arguments it picks up. A lot of time it will pick up package name, for loop iterates, etc. as well, which should not be input arguments
= list(c("a b", "cd "), c(" a", " b"))
list_string for(.x in seq_along(list_string)){
::str_remove_all(list_string[[.x]], '\\s') ->
stringr
list_string[[.x]]
} list_string
- it will pick up arguments
(list_string, .x, stringr, str_remove_all)
5.2.7 Function usage and default
5.2.7.1 Argument name in function call
<- function(x, y, z){ x+y+z} myfun2
myfun2(x=1, y=2, z=3)
myfun2(1, 2, 3) # ignore argument names if the order is correct
myfun2(z=3, x=1, y=2) # keep argument names can avoid wrong order problem
5.2.7.2 default
= c("a, b", "a, c, b")
string
{::str_split(string, ", ") -> string_splited
stringrtable(unlist(string_splited))
}
<- function(string){
tabulate_multipleSelections ::str_split(string, ", ") -> string_splited
stringrtable(unlist(string_splited))
}c("a, b", "a, c, b") |> tabulate_multipleSelections()
= c(
hobby 'sport, reading, movie',
'sport',
'movie, sport, reading',
'movie, Reading',
'sport')
|> tabulate_multipleSelections() hobby
Argument with default
Suppose sometimes you bumped into “;” separation data. But most of the time, it is “,”.
= c(
hobby2 'sport; reading; movie',
'sport',
'movie; sport; reading',
'movie; Reading',
'sport')
<- function(string, sep=","){
tabulate_multipleSelections2 = paste(sep, '')
pattern ::str_split(string, pattern) -> string_splited
stringrtable(unlist(string_splited))
}
|> tabulate_multipleSelections2()
hobby |> tabulate_multipleSelections2(sep=";") # or
hobby2 # tabulate_multipleSelections2(hobby2, sep=";")
Exercise 5.4 Import the wdi
data from 4.8 Exercise-5 and obtain iso2c_nonCountry
from Exercise 4.19
- The following code remove any non country entry in the data.
<- wdi$data
data_set <- c('ZH','ZI','1A','S3','B8','V2','Z4','4E','T4','XC','Z7','7E','T7','EU','F1','XE','XD','XF','ZT','XH','XI','XG','V3','ZJ','XJ','T2','XL','XO','XM','XN','ZQ','XQ','T3','XP','XU','XY','OE','S4','S2','V4','V1','S1','8S','T5','ZG','ZF','T6','XT','1W')
iso2c_nonCountry <- !(data_set$iso2c %in% iso2c_nonCountry)
pick_countries data_set[pick_countries, ]
Use it to create a function remove_nonCountries
so that any data frame, say df_example
with iso2c
in it can do the function call as the following to remove those non-country entries.
<- remove_nonCountries(data_set=df_example) df_example
- The following code remove non countries from the data set and narrow down further to year 2020 data. Then summarise the indicator’s mean, median, and range
$data |> remove_nonCountries() -> data_set
wdi
|> subset(year==2020) -> data_set2020 # it is the same as
data_set = "SG.GEN.PARL.ZS"
code
{|> range(na.rm=T) -> output_range
data_set2020[[code]] |> mean(na.rm=T) -> output_mean
data_set2020[[code]] |> median(na.rm=T) -> output_median
data_set2020[[code]] list(
mean=output_mean,
median=output_median,
range=list(output_range)
|> list2DF()
) }
Construct a function summarise_numerical
which can be used to produce a summary data frame of mean, median, and range for any given data set (as input argument data_set
) and a numerical feature column name (as input argument feature
). In other words, with the help of summarise_numerical
function, the above code chunk can be replace with
$data |> remove_nonCountries() -> data_set
wdi
|> subset(year==2020) -> data_set2020 # it is the same as
data_set = "SG.GEN.PARL.ZS"
code summarise_numerical(data_set=data_set2020, feature=code)
- Gender inequality is an important issue in social science. One possible indicator to compare this inequality across countries is:
- Proportion of seats held by women in national parliaments (%) (code name is “SG.GEN.PARL.ZS”).
What is the year range in the data set? For each year compute the mean of this indicator across countries. Is the trend of mean increasing over time?
- Create a function
get_meanTrendOverYears
when do the following function call, it will return a vector of the mean of all countries’ given code feature value over the years, with years as element names. (That is if mean is 2, 3, 8 for year 2010, 2011, 2012, then the returned vector should be the named numeric vectorc("2010"=2, "2011"=3, "2012"=8)
.)
get_meanTrendOverYears(data_set=data_set, code="SG.GEN.PARL.ZS")
5.3 Task by situation
Consider the following programming flow:
A school organized welcoming talks for students of either Economics or Sociology major in different rooms. For Economics majors, it is room 201. For Sociology majors, it is room 301. You want to print:
Please go to room 201. (for Economics majors)
Please go to room 301. (for Sociology majors)
An incoming student’s major is
<- sample(c("Economics", "Sociology"), 1) oneComingStudentMajor
What should you print?
# oneComingStudentMajor == "Economics"
print("Please go to room 201.")
# oneComingStudentMajor == "Sociology"
print("Please go to room 301.")
5.3.1 if
In computer programming, we can create a conditional execution for each possible action using:
if(oneComingStudentMajor == "Economics"){
print("Please go to room 201.")
}
if(oneComingStudentMajor == "Sociology"){
print("Please go to room 301.")
}
For each conditional execution,
if(...){....}
{...}
will be executed only if the(...)
part is A TRUE value.
The (...)
part is called a control flag. If the flag is up (i.e. TRUE), then the {...}
will be released to be executed. If the flag is down (i.e. FALSE), then the {...}
will be ignored (not executed).
Properties of a flag:
- A flag is to control whether a programming block should be released to the programming flow.
# T flag will always release the block for execution
if(T){print("TRUE block")}
# F flag will never release the block for execution
if(F){print("FALSE block")}
- A flag should only have ONE TRUE or FALSE. If flag is a vector, only the first one is used.
# T flag will always release the block for execution
if(c(T,T)){print("TT")}
if(c(T,F)){print("TF")}
if(c(F,F)){print("FF")}
if(c(F,T)){print("FT")}
# Use any() / all() if appropriate
if(c("a", "b")=="a"){print("case 1")}
if(any(c("a", "b")=="a")){print("case 2")}
if(all(c("a", "b")=="a")){print("case 3")}
- Flag should NOT HAVE
NA
.
# use `isTRUE` to ensure the block is relased only when a legitimate flag value of TRUE happens
if(isTRUE(T)){print("case 1")}
if(isTRUE("a")){print("case 2")}
if(isTRUE(c(T,T))){print("case 3")}
if(isTRUE(NA)){print("case 4")}
# there is also `isFALSE`.
Set length warning into an error:
Sys.setenv("_R_CHECK_LENGTH_1_CONDITION_" = "true")
Rock-paper-scissor
::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
rstudioapi
<- sample(c("r", "p", "s"), 1)
computerChoice
print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
# If user won
print("You won.")
# If computer won,
print("Computer won.")
# If tie,
print("There is a tie.")
We need to design three flags for each execution:
if(flag_userwin){print("You won.")}
if(flag_computerwin){print("Computer won.")}
if(flag_tie){print("There is a tie.")}
A possible way to define these flags is:
<- paste0(userChoice, computerChoice)
combination
<- combination %in% c("rs", "sp", "pr")
flag_userwin <- combination %in% c("sr", "ps", "pr")
flag_computerwin <- combination %in% c("rr", "ss", "pp") flag_tie
A complete program is:
::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
rstudioapi
<- sample(c("r", "p", "s"), 1)
computerChoice
print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
<- paste0(userChoice, computerChoice)
combination
### three control flows
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")}
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")}
if(combination %in% c("rr", "ss", "pp")){print("There is a tie.")}
For the following discussion, we will repetitively use the part before the control flow. For the ease of learning, let us define a function:
<- function(){
get_rock_paper_scissor_combination ::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
rstudioapi
<- sample(c("r", "p", "s"), 1)
computerChoice
print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
<- c("user"=userChoice, "computer"=computerChoice)
outcome
return(outcome)
}
5.3.2 else if
If only one of the flags can be TRUE, a more efficient control flow would be using else
to connect all the if conditional blocks.
<- get_rock_paper_scissor_combination()
outcome <- paste(outcome, collapse="")
combination
### one control flow
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(combination %in% c("rr", "ss", "pp")){print("There is a tie.")}
The
else
connected conditional executions is considered as one control flow.One control flow release at most one block to execute: the first TRUE flag will trigger its attached programming block’s execution. After that, the remaining flags and the their attached blocks will be ignored.
5.3.3 else
If all the flags exhaust all possible situations, the last if(...)
can be ignored.
<- get_rock_paper_scissor_combination()
outcome <- paste(outcome, collapse="")
combination # r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
print("There is a tie.")} {
Three types of if
control flow:
if(flag){...}
: release one block at most.
if(flag1){block1}else if(flag2){block2}...else if(flagN){blockN}
: release one block at most–the first TRUE block will be released.
if(flag1){block1}else if(flag2){block2}...else if(flagM){blockM}else{blockN}
: release definitely one block – either the first TRUE block, or blockN.
Any situation that does not make preceding flags true will be considered a TRUE for the last block. Weird thing can happen if you do not design flags properly. Try input anything other than “r”, “s”, “p”.
5.3.4 &&
and ||
<- get_rock_paper_scissor_combination()
outcome <- paste(outcome, collapse="")
combination
<- outcome[["user"]] %in% c("r", "s", "p")
flag_valid # r > s > p > r
if(flag_valid & combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(flag_valid & combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(flag_valid){print("There is a tie.")} else
print("Please input only r, s or p.")} {
Though the above control flow works, it is not efficient. For &
, computer will check all conditions. But in this case, if flag_valid is already FALSE
, there is no need to check combination
. Therefore, in control flow, it is better to use &&
to join multiple flag conditions.
... && ... && ... && ...
returnsFALSE
when it encounters the first FALSE and stop checking the conditions after that.
<- get_rock_paper_scissor_combination()
outcome <- paste(outcome, collapse="")
combination
<- outcome[["user"]] %in% c("r", "s", "p")
flag_valid # r > s > p > r
if(flag_valid && combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(flag_valid && combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(flag_valid){print("There is a tie.")} else
print("Please input only r, s or p.")} {
The same situation also happens is for |
. You should always use ||
for or-joining multiple flag conditions.
... || ... || ... || ...
returnsTRUE
when it encounters the first TRUE and stop checking the conditions after that.
There is also an efficient OR operator, ||
. It is efficient in the sense that when a TRUE in the front is returned, no further condition check is needed, the result will be TRUE.
TRUE || x # will return TRUE even x is unknown.
TRUE | x # will return ERROR
FALSE && x # will return FALSE even x is unknown.
FALSE & x # will return ERROR
When construct flags, you should only use &&
and ||
instead of &
and |
. The latter are for vector comparison.
Since control flow will only release the 1st TRUE block, any preceding blocks not released mean their flags are all FALSE.
if(flag1){block1} else
if(flag2){block2} else
{block3}
is the same as
if(flag1){block1} else
if(!flag1 && flag2){block2} else
if(!flag1 && !flag2){block3}
Another control flow design:
<- get_rock_paper_scissor_combination()
outcome <- paste(outcome, collapse="")
combination
<- outcome[["user"]] %in% c("r", "s", "p")
flag_valid # r > s > p > r
if(!flag_valid){
print("Please input only r, s or p.")
else
} if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
print("There is a tie.")} {
5.4 An example
Given the data set:
<-
dataSet1 data.frame(
cat = c('a','a','b','a','c','a','b','a','b','a','b','a','a','b','a'),
num=c(75,81,68,81,68,85,58,55,70,83,78,83,67,87,70))
In preliminary data summary stage, for categorical data (mainly means categorical data), we want to summarise its count number in each category, its highest count category (i.e. mode). For numerical data, we want to compute its mean and range, etc.
The summarisation procedures depends on what class the data vector is.
if data vector is categorical, tabulate and find mode.
if data vector is numeric, compute mean and range.
5.4.0.1 Design control flow
# summarise a data vector, say data_vector
# (1) construct flags
<- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_categorical <- is.numeric(data_vector)
flag_numerical
# (2) Assemble control flow
if(flag_categorical){
summarise_categorical(data_vector)
else if(flag_numerical){
} summarise_numerical(data_vector)
}
- The paths here are expressed as a function call since we can always make a programming block portable by defining a function. Then the execution of the programming block will be a function call.
Path of summarise categorical:
{= table(data_vector)
output_table = output_table[(output_table == max(output_table))]
output_mode list(
table=output_table,
mode=output_mode
) }
Path of summarise numerical:
{= mean(data_vector, na.rm=T)
output_mean = range(data_vector, na.rm=T)
output_range list(
mean=output_mean,
range=output_range
) }
Complete control flow:
# summarise a data vector, say data_vector
# (1) construct flags
<- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_categorical <- is.numeric(data_vector)
flag_numerical
# (2) Assemble control flow
if(flag_categorical){
= table(data_vector)
output_table = output_table[(output_table == max(output_table))]
output_mode list(
table=output_table,
mode=output_mode
)else if(flag_numerical){
} = mean(data_vector, na.rm=T)
output_mean = range(data_vector, na.rm=T)
output_range list(
mean=output_mean,
range=output_range
) }
Or
Path of summarise categorical:
<- function(data_vector){
summarise_categorical = table(data_vector)
output_table = output_table[(output_table == max(output_table))]
output_mode list(
table=output_table,
mode=output_mode
)
}
<- function(data_vector){
summarise_numerical = mean(data_vector, na.rm=T)
output_mean = range(data_vector, na.rm=T)
output_range list(
mean=output_mean,
range=output_range
) }
OR
# summarise a data vector, say data_vector
# (0) load summarise path function (can also put in a source .R script then source it first)
<- function(data_vector){
summarise_categorical = table(data_vector)
output_table = output_table[(output_table == max(output_table))]
output_mode list(
table=output_table,
mode=output_mode
)
}
<- function(data_vector){
summarise_numerical = mean(data_vector, na.rm=T)
output_mean = range(data_vector, na.rm=T)
output_range list(
mean=output_mean,
range=output_range
)
}
# (1) construct flags
<- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_categorical <- is.numeric(data_vector)
flag_numerical
# (2) Assemble control flow
if(flag_categorical){
summarise_categorical(data_vector)
else if(flag_numerical){
} summarise_numerical(data_vector)
}
We can pack the above program into a big block and define a function to carry it.
# summarise a data vector, say data_vector
# (0) load summarise path function (can also put in a source .R script then source it first)
<- function(data_vector){
summarise_categorical = table(data_vector)
output_table = output_table[(output_table == max(output_table))]
output_mode list(
table=output_table,
mode=output_mode
)
}
<- function(data_vector){
summarise_numerical = mean(data_vector, na.rm=T)
output_mean = range(data_vector, na.rm=T)
output_range list(
mean=output_mean,
range=output_range
)
}
<- function(data_vector) {
summarise_data_vector # (1) construct flags
<- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_categorical <- is.numeric(data_vector)
flag_numerical
# (2) Assemble control flow
if(flag_categorical){
summarise_categorical(data_vector)
else if(flag_numerical){
} summarise_numerical(data_vector)
} }
summarise_data_vector(dataSet1$cat)
summarise_data_vector(dataSet1$num)
An even better one is to add a function call summarise_dataFrame
:
<- function(dataSet1){
summarise_dataFrame <- vector("list", length(dataSet1))
.summary for(.x in seq_along(dataSet1)){
<- summarise_data_vector(dataSet1[[.x]])
.summary[[.x]]
}names(.summary) <- names(dataSet1) # element names as data set column names
return(.summary)
}
summarise_dataFrame(dataSet1)