Chapter 5 Programming Design
5.1 For each
5.1.1 For each observation
dataSet <- list(
list(
gender="male", height=178, age=32,
children=list(
list(gender="male", age=5),
list(gender="male", age=0))
),
list(
gender="female", height=166, age=30,
children=list(
list(gender="female", age=3))
),
list(
gender="male", height=170, age=42,
children=list(
list(gender="male", age=10),
list(gender="female", age=8))
),
list(
gender="male", height=188, age=22,
children=list()
),
list(
gender="male", height=172, age=52,
children=list(
list(gender="female", age=25),
list(gender="female", age=23))
),
list(
gender="female", height=160, age=42,
children=list(
list(gender="female", age=11))
)
)For each observation, how many children does each have?
analysis <- list()
analysis$number_of_children <- list()
analysis$number_of_children[[1]] <-length(dataSet[[1]]$children)
analysis$number_of_children[[2]] <-length(dataSet[[2]]$children)
analysis$number_of_children[[3]] <-length(dataSet[[3]]$children)
analysis$number_of_children[[4]] <-length(dataSet[[4]]$children)
analysis$number_of_children[[5]] <-length(dataSet[[5]]$children)
analysis$number_of_children[[6]] <-length(dataSet[[6]]$children)5.1.2 Information container
analysis$number_of_children is an information container to contain all the pieces of information that you want. Here, we use declare-then-add method to save our task information. When declaring an object, only the storage type is determined, but not its length. In computer science,
- if the number of pieces of information (not each piece’s value) is already known,
it is more efficient to declare a fixed-length object with its length equal to the number of pieces of information first.
In R, we can do this via:
vector("storage_type", length)So instead of
analysis$number_of_children <- list()we should do
analysis$number_of_children <- vector("list", length=6)analysis <- list()
analysis$number_of_children <- vector("list", length=6)
analysis$number_of_children[[1]] <-length(dataSet[[1]]$children)
analysis$number_of_children[[2]] <-length(dataSet[[2]]$children)
analysis$number_of_children[[3]] <-length(dataSet[[3]]$children)
analysis$number_of_children[[4]] <-length(dataSet[[4]]$children)
analysis$number_of_children[[5]] <-length(dataSet[[5]]$children)
analysis$number_of_children[[6]] <-length(dataSet[[6]]$children)5.1.3 For loop
When for each… statement appears, you will see similar task repeated several times, the only difference is the position index. In computer science, for loop is a way to express repeating a task several times.
What task? (Task pattern)
How to repeat? (Iterate generation)
In our earlier task, there was a pattern repeated 6 times. If we replaced position 1, 2, …, 6 with .x, all we see will be:
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)Then the task pattern is:
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)For convenience of our discussion, let’s call it .x-pattern, where .x is called iterate.
We need computer to generate iterate from 1 to 6, and repeat .x-pattern each time with .x replaced with the iterate generated. We can use
for(.x in c(1, 2, 3, 4, 5, 6)){.x-pattern}
to do the job. This is called for loop.
for(.x in c(1, 2, 3, 4, 5, 6)){
# .x-pattern
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
}The complete for loop includes the container declaration:
analysis <- list()
analysis$number_of_children <- vector("list", length=6)
for(.x in c(1, 2, 3, 4, 5, 6)){
# .x-pattern
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
}
class(analysis$number_of_children) # a listSince c(1, 2, 3, 4, 5, 6) is based on the length of dataSet, we can express iterate generation as:
.x in 1:length(dataSet)A better way is to use:
.x in seq_along(dataSet)seq_along(.)does almost the same job as1:length(.)except whendataSetlength is zero onlyseq_along(.)would generate correct iterate (i.e. no iteration).
The best version now would be:
analysis <- list()
analysis$number_of_children <- vector("list", length=6)
for(.x in seq_along(dataSet)){
# .x-pattern
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
}
class(analysis$number_of_children) # a listThe container creator:
vector("storage_type", length)is to create container based on type – not on class. If you need your information to be certain class, you can always do parsing later – after you collect all the information.
dataSet1 <- c('1999-03-16','1996-07-14','1998-11-06','1999-07-16','1995-09-30','1997-09-08','1998-08-08','1999-03-26','1999-02-13','1997-07-04')
# declare a fixed length container based on information type before parsing
birthdays <- vector("character", length(dataSet1))
for(.x in seq_along(dataSet1)){
birthdays[[.x]] <- dataSet1[[.x]]
}
# parse your collected information afterward.
birthdays |> lubridate::ymd() -> birthdaysExercise 5.1 Load the following data set:
# Observation by observation
concerts_obo <-
jsonlite::fromJSON("https://cloud.culture.tw/frontsite/trans/SearchShowAction.do?method=doFindTypeJ&category=17", simplifyDataFrame = F)Data set concerts_obo has difference performance information as its observations. For each performance, there can be more than one show, and each show’s information is contained inside the observation’s showInfo element.
How many shows does the 5th performance have?
If we are to find out the number of shows for each performance, what is the .x-pattern?
Find out the number of shows for each performance, and put the answer as an integer vector under
analysis$number_of_shows. (i.e.analysis$number_of_shows[[5]]will show your answer in 1.)Construct a data frame with columns, title, date, location and locationName, where title comes from each observation’s title, and the rest three are from each observation’s 1st showInfo. In addition, date column has to be “POSIXct” “POSIXt” class with Taipei time zone.
flag=NA
if(as.logical(flag)){print("T")}else{print("F")}5.1.4 For each group
dataSet1 <-
data.frame(
team=c('team 2','team 3','team 1','team 2','team 3','team 1','team 2','team 1','team 3','team 2','team 3','team 1','team 3','team 1','team 2'),
entranceYear = c(100,100,101,100,100,100,101,100,100,101,101,100,101,100,100),
grade=c(98,95,76,87,67,83,84,94,70,64,99,77,89,62,65)
)For each team, compute its mean grade.
Solving the task requires first:
- Form grouped data set: split your data set into different grouped data sets (i.e. the data set whose observations share certain feature value) based on some feature (i.e. team here).
Then
- For each grouped data set, compute its mean grade.
Form grouped data set
split(dataSet1, dataSet1$team) -> grouped_dataset1
View(grouped_dataset1)split(x, f)wherefis supposed to be a factor–sodataSet1$teamwill be converted to factor behind your back.
Dealing with each grouped data set
For each grouped data set, compute its mean grade.
group_meanGrades <- vector("double", length(grouped_dataset1))
group_meanGrades[[1]] <- mean(grouped_dataset1[[1]]$grade)
group_meanGrades[[2]] <- mean(grouped_dataset1[[2]]$grade)
group_meanGrades[[3]] <- mean(grouped_dataset1[[3]]$grade)Use for loop:
group_meanGrades <- vector("double", length(grouped_dataset1))
for(.x in seq_along(grouped_dataset1)){
group_meanGrades[[.x]] <- mean(grouped_dataset1[[.x]]$grade)
}Put all together. To solve:
For each team, compute its mean grade.
# Form grouped data sets
split(dataSet1, dataSet1$team) -> grouped_dataset1
# For each data sets, solve the required problem
group_meanGrades <- vector("double", length(grouped_dataset1))
for(.x in seq_along(grouped_dataset1)){
group_meanGrades[[.x]] <- mean(grouped_dataset1[[.x]]$grade)
}Exercise 5.2 Regarding the drug data from Exercise 4.8.
For each kind of drug (
毒品品項), what was the average amount of drug (the mean of數量(淨重)_克) found?Add a column
dateto the data frame which is a date class created from the column發生日期
For the following question, we only consider those 毒品品項 that contains the words “安非他命”.
- For each year, what was the total amount found? (Apply
lubridate::year()todatecolumn will get you year)
Multiple grouping factors
Sometimes we want to split data set based on more than one factors. In this case we need to create the interaction of these factors using : operator.
Split dataSet1 based on team and entranceYear features.
dataSet1$team |> factor() -> dataSet1$team
dataSet1$entranceYear |> factor() -> dataSet1$entranceYear
dataSet1$team:dataSet1$entranceYear -> .interaction
print(.interaction)
dataSet1 |>
split(.interaction) -> grouped_dataSets2
View(grouped_dataSets2)5.2 Systematic analysis
It’s common for data analysis to follow certain procedures in its data summarisation stage.
dataSet3 <-
data.frame(
dates = c("2016-11-15", NA, NA, "1997-05-07", "1995-08-25", "2002-09-20", NA, NA, NA, "1995-07-16", "2011-06-22"),
grades = c(29, 53, 26, 27, 55, 69, NA, NA, 63, NA, 56),
genders = c("Male", "Female", "Male", "Male", "Female", "Female", NA, "Male", "Male", "Female", "Female"),
majors = c("economics", "economics", NA, "economics", "economics", "economics", "economics", "statistics", "law", "economics", "law")
)# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_checkThe above two are very common to be applied to most data sets. It would be nice if instead of a long procedure:
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_checkwe can do something like:
class_check <- check_class(dataSet3)
na_check <- check_na(dataSet3)The above lines need three things to happen:
class_check <-andna_check <-need values to bind with, socheck_class(dataSet3)andcheck_na(dataSet3)must return a value. (function return value)Both
check_naandcheck_classfunctions must be searchable in your current working space. (check the Environment tab.) (function search path)check_class(dataSet3)andcheck_na(dataSet3)mean both functions are called forward to work with the value ofdataSet3. (Computer science jargon is function call.) They must have sufficient objects to complete their attached programming block. (execution environment)
5.2.1 Function as reusable programming block
Function in general is a portable/reusable programming block.
Standard programming blocks:
# checking blocks for dataSet3
class_check <- {
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}
na_check <- {
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check # the last executed line
}- The names to bind the visible value of the last executed line (i.e.
class_check <- {...}andna_check <- {...}) can be any name besideclass_checkandna_check.
If we have another data set needs to do those two checks:
dataSet4 = data.frame(
x = c(6,10,10,6,10,6,NA,NA,3,7),
y = c(2,3,6,6,8,2,8,6,8,3)
)Unless the new data set is named dataSet3, you cannot re-run the previous check chunk.
- You have to replace
dataSet3withdataSet4if you simply copy/paste the checking programing blocks:
dataSet3 = data.frame(...)
# checking blocks for dataSet3
:
dataSet4 = data.frame(...)
dataSet3 <- dataSet4
# checking blocks for dataSet3We can make both programming blocks portable by creating functions:
check_class <- function(){
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}
check_na <- function(){
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check # the last executed line
}<- function(){...}will bind the programming block with the name you specified. Here arecheck_classandcheck_na.The last executed line will return a value if it is a name call or
return(name):
check_class <- function(){
...
return(class_check) # the last executed line returned
}
check_na <- function(){
...
return(na_check) # the last executed line returned
}Whenever we need to re-use the programming blocks, we just do a function call.
# function definition
check_class <- function(){...}
check_na <- function(){...}
# a long program
dataSet3 = data.frame(...)
## need check here
class_check1 <- check_class()
na_check1 <- check_na()
## other part of the program
dataSet4 = data.frame(...)
dataSet3 = dataSet4
## need check here
class_check2 <- check_class()
na_check2 <- check_na()It is equivalent to:
# a long program
dataSet3 = data.frame(...)
## need check here
class_check1 <- {
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}
na_check1 <- {
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check # the last executed line
}
## other part of the program
dataSet4 = data.frame(...)
dataSet3 = dataSet4
## need check here
class_check2 <- {
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}
na_check2 <- {
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check # the last executed line
}Several tips to improve your programming efficiency.
Whenever your programming lines are about solving a specific task, use programming blocks. Later if you want to reuse the programming block, you will know where to locate it and make it into a function.
Your function name should start with a verb since it is about doing something.
Exercise 5.3 It is very common that economic model has some equilibrium condition as the intersection of two straight lines, like demand and supply. They can be expressed as: \[\mbox{demand: }p_d(q)=a+bq\\ \mbox{supply: }p_s(q)=c+dq \] In equilibrium, \[(p*, q*)\mbox{ where }p_d(q*)=p_s(q*)\mbox{ and }p*=p_d(q*)=p_s(q*).\]
If you solve for \(q*\) and \(p*\), \[q*=(a-c)/(d-b)\\ p*=a+bq*.\]
- Run the following code and solve for equilibrium and save it in a list with two element names
q_starandp_star. Each has its value in corresponding to \(q*\) and \(p*\) that your program solved.
a=1; b=-3; c=0; d=1- Create a function called solve_equilibrium so that the following code would work:
a=1; b=-3; c=0; d=1
equilibrium1 <- solve_equilibrium()
print(equilibrium1)
a=2; b=-3; c=0; d=1
equilibrium2 <- solve_equilibrium()
print(equilibrium2)
a=1; b=-3; c=0; d=1.5
equilibrium3 <- solve_equilibrium()
print(equilibrium3)5.2.2 Function on search path
When we need to run the following codes:
# first check_class check_na function call
dataSet3 <- data.frame(...)
check_class()
check_na()check_classandcheck_nahave to be on the search path in your R session, which normally starts with Global Environment, followed by different packages.RStudio: Environment window, Global Environment dropdown.
check_class and check_na are function names. check_class() and check_na() that uses the programming block that function carries are called function call.
To place your function in the global environment, you can do one of the two approaches.
Approach 1.
Define in the main script before their function call
check_class <- function(){
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}
check_na <- function(){...}
# first check_class check_na function call
dataSet3 <- data.frame(...)
check_class()
check_na()Even though the programming block (aka function body) requires an object named
dataSet3in it. Those objects inside the function body does not have to exist at the time of definition.- R won’t check the existence of objects inside the programming block at the time of function creation. The existence will be checked only at the time of the function call. This property is called lazy evaluation in computer science.
5.2.2.1 Approach 2.
Because function definition does not need any object inside the programming block to exist, they can be saved in a totally different file and used later in a totally different environment.
Save all your function in a separate .R script file, say myFunctions.R, then source the script file before the first function call used.
source("myFunctions.R")
# first check_class check_na function call
dataSet3 <- data.frame(...)
check_class()
check_na()- This approach is better for organizing your program, and keep your main script flow more logically fluent.
5.2.3 Environments
After functions are on your search path, it’s time to understand how a function call is processed. To have a clear understanding, let’s imagine a new R session where the global environment is empty. Then run the following lines:
x = 1
y = 33
robot1 <- function(){
x+y
browser()
}So the global environment is populated with two objects x and y and one robot robot1. (Indeed, function is more like a robot in real life!)
Now do a function call
robot1()When objects are imported from the global environment, it will make a copy in its execution environment, and run the function body inside it.
Because objects in execution environment imported from Global environment are copies. Function body that changes object values will affect only the copy object but not the source object in the Global environment.
change_x_1stValue2a <- function(){
x[1] <- "a"
}
x=c("b", "b", "b")
change_x_1stValue2a() # the copy of x in execution environment is changed
x # source x is unchangedIf you do want to change x value in the global environment, you need to do element value replacement, ie.
x = new_x # where new_x must be a return from change_x_1stValue2a functionModify change_x_1stValue2a function
change_x_1stValue2a <- function(){
x[1] <- "a"
return(x)
}
x=c("b", "b", "b")
x=change_x_1stValue2a()
x5.2.4 Input arguments
Sometimes you don’t want certain objects in function body to be copied and imported from Global environment, but to create a new one inside the execution environment. You can put those object name in function input arguments.
robot2 <- function(x){
x+y
browser()
}Now do the function call
robot2(x=20)Caller hands in a value for
robot2to bind withxin the execution environment. Or you can imagine caller tellsrobot2whatxis, thenrobot2creates objectxdirectly inside the execution environment – not copy/import from the global environmentx.Only y is copied/imported from the global environment.
Put object name as an input argument when creating the function can isolate the value of the same-name object in the global environment.
Earlier we have
dataSet4 = data.frame(...)
dataSet3 <- dataSet4
check_class()If we put dataSet3 as check_class’s input argument, that is
check_class <- function(dataSet3){
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}Then the function called to work on dataSet4 can be more straight forward:
dataSet4 = data.frame(...)
check_class(dataSet3 = dataSet4)The input specification of the function call will not affect
dataSet3in the global environment. It simply tellscheck_classrobot how to create adataSet3object in the execution environment directly.dataSet3will not be copied/imported from the Global environment.
In practice, input argument are object we want to create directly inside the execution environment under caller’s specification. Usually we will name the input argument in a more generous way instead of dataSet3, like:
check_class <- function(data_set){
# class check
class_check <- vector("integer", length(data_set))
for(.x in seq_along(data_set)){
class_check[[.x]] <- class(data_set[[.x]])
}
names(class_check) <- names(data_set)
class_check # the last executed line
}In function definition,
Input arguments ensure that those argument objects value are assigned from the function call, rather than copied/imported from the global environment.
For those objects in function body, they must be either defined by the function call (i.e. input argument), borrowed from the global environment, or an intermediate step by product inside the body.
myfun <- function(x){
x+y -> z
z**2+6
}- x will be defined in function call. y will be imported from the global environment, z is a byproduct of intermediate steps.
myfun(3)y=5
myfun(3)5.2.5 Advanced Concept of Environments
There are three places regarding function that are important:
where it is lives,
where it is called for a task,
where it solves the task.
function environment
Place 1 is about where on R’s searching path we can find the function.
myFun <- function(){
x+y
}- We can find
myFunin the global environment. So the function environment ofmyFunis the global environment.
caller environment
Place 2 is called caller environment. It is where the function call happens. The function’s caller environment will be attached to its function environment as part of search path for the function to have the job done
Think of a function as a plumber whose lives in Taichung (function environment), gets a job call from Taipei (caller environment). When he needs some tools to do his job, he will start from his familiar Taichung to find those tools. If not found, he will try to find in Taipei.
myFun(x=3)The function call happens in the global environment as well.
myFunplumber lives and gets a job in global environment.
execution environment
Place 3 is called execution environment. It is where the task content (aka function body) is done.
When a function does the job as the function call was specified, it creates a new environment to do the work – like an office. The environment is almost empty (no object to use), but it is attached to the function environment.
In many cases, at least for this course, those functions that we create live and get a job call in the same place, i.e. the Global environment. And the execution environment office is a place attached to the global environment. When the office needs some object for the job, but the object is not in the office, the function will go to global environment to look for it.
5.2.6 Extract function
Functions that import objects from Global environment can create future programming problem.
myfun <- function() xx + 5
# forget myfun needs to import xx
myfun()
# there is a xx totally for different purpose
xx=c("a", "b")
myfun()A good principle is unless you know what you are doing, otherwise never import objects from the global environment in function execution.
- Always put objects the body needs (except the by product) in function arguments.
check_class <- function(dataSet3) {
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check # the last executed line
}
check_na <- function(dataSet3){
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check # the last executed line
}RStudio, select the programming block, then Code > Extract Function to create a function whose execution never import any object value from the global environment.
- Remember to check the input arguments it picks up. A lot of time it will pick up package name, for loop iterates, etc. as well, which should not be input arguments
list_string = list(c("a b", "cd "), c(" a", " b"))
for(.x in seq_along(list_string)){
stringr::str_remove_all(list_string[[.x]], '\\s') ->
list_string[[.x]]
}
list_string- it will pick up arguments
(list_string, .x, stringr, str_remove_all)
5.2.7 Function usage and default
5.2.7.1 Argument name in function call
myfun2 <- function(x, y, z){ x+y+z}myfun2(x=1, y=2, z=3)
myfun2(1, 2, 3) # ignore argument names if the order is correct
myfun2(z=3, x=1, y=2) # keep argument names can avoid wrong order problem5.2.7.2 default
string = c("a, b", "a, c, b")
{
stringr::str_split(string, ", ") -> string_splited
table(unlist(string_splited))
}tabulate_multipleSelections <- function(string){
stringr::str_split(string, ", ") -> string_splited
table(unlist(string_splited))
}
c("a, b", "a, c, b") |> tabulate_multipleSelections()
hobby = c(
'sport, reading, movie',
'sport',
'movie, sport, reading',
'movie, Reading',
'sport')
hobby |> tabulate_multipleSelections()Argument with default
Suppose sometimes you bumped into “;” separation data. But most of the time, it is “,”.
hobby2 = c(
'sport; reading; movie',
'sport',
'movie; sport; reading',
'movie; Reading',
'sport')tabulate_multipleSelections2 <- function(string, sep=","){
pattern = paste(sep, '')
stringr::str_split(string, pattern) -> string_splited
table(unlist(string_splited))
}hobby |> tabulate_multipleSelections2()
hobby2 |> tabulate_multipleSelections2(sep=";") # or
# tabulate_multipleSelections2(hobby2, sep=";")Exercise 5.4 Import the wdi data from 4.8 Exercise-5 and obtain iso2c_nonCountry from Exercise 4.19
- The following code remove any non country entry in the data.
data_set <- wdi$data
iso2c_nonCountry <- c('ZH','ZI','1A','S3','B8','V2','Z4','4E','T4','XC','Z7','7E','T7','EU','F1','XE','XD','XF','ZT','XH','XI','XG','V3','ZJ','XJ','T2','XL','XO','XM','XN','ZQ','XQ','T3','XP','XU','XY','OE','S4','S2','V4','V1','S1','8S','T5','ZG','ZF','T6','XT','1W')
pick_countries <- !(data_set$iso2c %in% iso2c_nonCountry)
data_set[pick_countries, ]Use it to create a function remove_nonCountries so that any data frame, say df_example with iso2c in it can do the function call as the following to remove those non-country entries.
df_example <- remove_nonCountries(data_set=df_example)- The following code remove non countries from the data set and narrow down further to year 2020 data. Then summarise the indicator’s mean, median, and range
wdi$data |> remove_nonCountries() -> data_set
data_set |> subset(year==2020) -> data_set2020 # it is the same as
code = "SG.GEN.PARL.ZS"
{
data_set2020[[code]] |> range(na.rm=T) -> output_range
data_set2020[[code]] |> mean(na.rm=T) -> output_mean
data_set2020[[code]] |> median(na.rm=T) -> output_median
list(
mean=output_mean,
median=output_median,
range=list(output_range)
) |> list2DF()
}Construct a function summarise_numerical which can be used to produce a summary data frame of mean, median, and range for any given data set (as input argument data_set) and a numerical feature column name (as input argument feature). In other words, with the help of summarise_numerical function, the above code chunk can be replace with
wdi$data |> remove_nonCountries() -> data_set
data_set |> subset(year==2020) -> data_set2020 # it is the same as
code = "SG.GEN.PARL.ZS"
summarise_numerical(data_set=data_set2020, feature=code)- Gender inequality is an important issue in social science. One possible indicator to compare this inequality across countries is:
- Proportion of seats held by women in national parliaments (%) (code name is “SG.GEN.PARL.ZS”).
What is the year range in the data set? For each year compute the mean of this indicator across countries. Is the trend of mean increasing over time?
- Create a function
get_meanTrendOverYearswhen do the following function call, it will return a vector of the mean of all countries’ given code feature value over the years, with years as element names. (That is if mean is 2, 3, 8 for year 2010, 2011, 2012, then the returned vector should be the named numeric vectorc("2010"=2, "2011"=3, "2012"=8).)
get_meanTrendOverYears(data_set=data_set, code="SG.GEN.PARL.ZS")5.3 Task by situation
Consider the following programming flow:
A school organized welcoming talks for students of either Economics or Sociology major in different rooms. For Economics majors, it is room 201. For Sociology majors, it is room 301. You want to print:
Please go to room 201. (for Economics majors)
Please go to room 301. (for Sociology majors)
An incoming student’s major is
oneComingStudentMajor <- sample(c("Economics", "Sociology"), 1)What should you print?
# oneComingStudentMajor == "Economics"
print("Please go to room 201.")
# oneComingStudentMajor == "Sociology"
print("Please go to room 301.")5.3.1 if
In computer programming, we can create a conditional execution for each possible action using:
if(oneComingStudentMajor == "Economics"){
print("Please go to room 201.")
}
if(oneComingStudentMajor == "Sociology"){
print("Please go to room 301.")
}For each conditional execution,
if(...){....}{...}will be executed only if the(...)part is A TRUE value.
The (...) part is called a control flag. If the flag is up (i.e. TRUE), then the {...} will be released to be executed. If the flag is down (i.e. FALSE), then the {...} will be ignored (not executed).
Properties of a flag:
- A flag is to control whether a programming block should be released to the programming flow.
# T flag will always release the block for execution
if(T){print("TRUE block")}
# F flag will never release the block for execution
if(F){print("FALSE block")}- A flag should only have ONE TRUE or FALSE. If flag is a vector, only the first one is used.
# T flag will always release the block for execution
if(c(T,T)){print("TT")}
if(c(T,F)){print("TF")}
if(c(F,F)){print("FF")}
if(c(F,T)){print("FT")}# Use any() / all() if appropriate
if(c("a", "b")=="a"){print("case 1")}
if(any(c("a", "b")=="a")){print("case 2")}
if(all(c("a", "b")=="a")){print("case 3")}- Flag should NOT HAVE
NA.
# use `isTRUE` to ensure the block is relased only when a legitimate flag value of TRUE happens
if(isTRUE(T)){print("case 1")}
if(isTRUE("a")){print("case 2")}
if(isTRUE(c(T,T))){print("case 3")}
if(isTRUE(NA)){print("case 4")}
# there is also `isFALSE`.Set length warning into an error:
Sys.setenv("_R_CHECK_LENGTH_1_CONDITION_" = "true")Rock-paper-scissor
rstudioapi::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
computerChoice <- sample(c("r", "p", "s"), 1)
print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
# If user won
print("You won.")
# If computer won,
print("Computer won.")
# If tie,
print("There is a tie.")We need to design three flags for each execution:
if(flag_userwin){print("You won.")}
if(flag_computerwin){print("Computer won.")}
if(flag_tie){print("There is a tie.")}A possible way to define these flags is:
combination <- paste0(userChoice, computerChoice)
flag_userwin <- combination %in% c("rs", "sp", "pr")
flag_computerwin <- combination %in% c("sr", "ps", "pr")
flag_tie <- combination %in% c("rr", "ss", "pp")A complete program is:
rstudioapi::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
computerChoice <- sample(c("r", "p", "s"), 1)
print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
combination <- paste0(userChoice, computerChoice)
### three control flows
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")}
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")}
if(combination %in% c("rr", "ss", "pp")){print("There is a tie.")}For the following discussion, we will repetitively use the part before the control flow. For the ease of learning, let us define a function:
get_rock_paper_scissor_combination <- function(){
rstudioapi::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
computerChoice <- sample(c("r", "p", "s"), 1)
print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
outcome <- c("user"=userChoice, "computer"=computerChoice)
return(outcome)
}5.3.2 else if
If only one of the flags can be TRUE, a more efficient control flow would be using else to connect all the if conditional blocks.
outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")
### one control flow
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(combination %in% c("rr", "ss", "pp")){print("There is a tie.")}The
elseconnected conditional executions is considered as one control flow.One control flow release at most one block to execute: the first TRUE flag will trigger its attached programming block’s execution. After that, the remaining flags and the their attached blocks will be ignored.
5.3.3 else
If all the flags exhaust all possible situations, the last if(...) can be ignored.
outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
{print("There is a tie.")}Three types of if control flow:
if(flag){...}: release one block at most.
if(flag1){block1}else if(flag2){block2}...else if(flagN){blockN}: release one block at most–the first TRUE block will be released.
if(flag1){block1}else if(flag2){block2}...else if(flagM){blockM}else{blockN}: release definitely one block – either the first TRUE block, or blockN.
Any situation that does not make preceding flags true will be considered a TRUE for the last block. Weird thing can happen if you do not design flags properly. Try input anything other than “r”, “s”, “p”.
5.3.4 && and ||
outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")
flag_valid <- outcome[["user"]] %in% c("r", "s", "p")
# r > s > p > r
if(flag_valid & combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(flag_valid & combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(flag_valid){print("There is a tie.")} else
{print("Please input only r, s or p.")}Though the above control flow works, it is not efficient. For &, computer will check all conditions. But in this case, if flag_valid is already FALSE, there is no need to check combination. Therefore, in control flow, it is better to use && to join multiple flag conditions.
... && ... && ... && ...returnsFALSEwhen it encounters the first FALSE and stop checking the conditions after that.
outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")
flag_valid <- outcome[["user"]] %in% c("r", "s", "p")
# r > s > p > r
if(flag_valid && combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(flag_valid && combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(flag_valid){print("There is a tie.")} else
{print("Please input only r, s or p.")}The same situation also happens is for |. You should always use || for or-joining multiple flag conditions.
... || ... || ... || ...returnsTRUEwhen it encounters the first TRUE and stop checking the conditions after that.
There is also an efficient OR operator, ||. It is efficient in the sense that when a TRUE in the front is returned, no further condition check is needed, the result will be TRUE.
TRUE || x # will return TRUE even x is unknown.
TRUE | x # will return ERRORFALSE && x # will return FALSE even x is unknown.
FALSE & x # will return ERRORWhen construct flags, you should only use && and || instead of & and |. The latter are for vector comparison.
Since control flow will only release the 1st TRUE block, any preceding blocks not released mean their flags are all FALSE.
if(flag1){block1} else
if(flag2){block2} else
{block3} is the same as
if(flag1){block1} else
if(!flag1 && flag2){block2} else
if(!flag1 && !flag2){block3}Another control flow design:
outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")
flag_valid <- outcome[["user"]] %in% c("r", "s", "p")
# r > s > p > r
if(!flag_valid){
print("Please input only r, s or p.")
} else
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
{print("There is a tie.")}5.4 An example
Given the data set:
dataSet1 <-
data.frame(
cat = c('a','a','b','a','c','a','b','a','b','a','b','a','a','b','a'),
num=c(75,81,68,81,68,85,58,55,70,83,78,83,67,87,70))In preliminary data summary stage, for categorical data (mainly means categorical data), we want to summarise its count number in each category, its highest count category (i.e. mode). For numerical data, we want to compute its mean and range, etc.
The summarisation procedures depends on what class the data vector is.
if data vector is categorical, tabulate and find mode.
if data vector is numeric, compute mean and range.
5.4.0.1 Design control flow
# summarise a data vector, say data_vector
# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)
# (2) Assemble control flow
if(flag_categorical){
summarise_categorical(data_vector)
} else if(flag_numerical){
summarise_numerical(data_vector)
}- The paths here are expressed as a function call since we can always make a programming block portable by defining a function. Then the execution of the programming block will be a function call.
Path of summarise categorical:
{
output_table = table(data_vector)
output_mode = output_table[(output_table == max(output_table))]
list(
table=output_table,
mode=output_mode
)
}Path of summarise numerical:
{
output_mean = mean(data_vector, na.rm=T)
output_range = range(data_vector, na.rm=T)
list(
mean=output_mean,
range=output_range
)
}Complete control flow:
# summarise a data vector, say data_vector
# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)
# (2) Assemble control flow
if(flag_categorical){
output_table = table(data_vector)
output_mode = output_table[(output_table == max(output_table))]
list(
table=output_table,
mode=output_mode
)
} else if(flag_numerical){
output_mean = mean(data_vector, na.rm=T)
output_range = range(data_vector, na.rm=T)
list(
mean=output_mean,
range=output_range
)
}Or
Path of summarise categorical:
summarise_categorical <- function(data_vector){
output_table = table(data_vector)
output_mode = output_table[(output_table == max(output_table))]
list(
table=output_table,
mode=output_mode
)
}
summarise_numerical <- function(data_vector){
output_mean = mean(data_vector, na.rm=T)
output_range = range(data_vector, na.rm=T)
list(
mean=output_mean,
range=output_range
)
}OR
# summarise a data vector, say data_vector
# (0) load summarise path function (can also put in a source .R script then source it first)
summarise_categorical <- function(data_vector){
output_table = table(data_vector)
output_mode = output_table[(output_table == max(output_table))]
list(
table=output_table,
mode=output_mode
)
}
summarise_numerical <- function(data_vector){
output_mean = mean(data_vector, na.rm=T)
output_range = range(data_vector, na.rm=T)
list(
mean=output_mean,
range=output_range
)
}
# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)
# (2) Assemble control flow
if(flag_categorical){
summarise_categorical(data_vector)
} else if(flag_numerical){
summarise_numerical(data_vector)
}We can pack the above program into a big block and define a function to carry it.
# summarise a data vector, say data_vector
# (0) load summarise path function (can also put in a source .R script then source it first)
summarise_categorical <- function(data_vector){
output_table = table(data_vector)
output_mode = output_table[(output_table == max(output_table))]
list(
table=output_table,
mode=output_mode
)
}
summarise_numerical <- function(data_vector){
output_mean = mean(data_vector, na.rm=T)
output_range = range(data_vector, na.rm=T)
list(
mean=output_mean,
range=output_range
)
}
summarise_data_vector <- function(data_vector) {
# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)
# (2) Assemble control flow
if(flag_categorical){
summarise_categorical(data_vector)
} else if(flag_numerical){
summarise_numerical(data_vector)
}
}summarise_data_vector(dataSet1$cat)
summarise_data_vector(dataSet1$num)An even better one is to add a function call summarise_dataFrame:
summarise_dataFrame <- function(dataSet1){
.summary <- vector("list", length(dataSet1))
for(.x in seq_along(dataSet1)){
.summary[[.x]] <- summarise_data_vector(dataSet1[[.x]])
}
names(.summary) <- names(dataSet1) # element names as data set column names
return(.summary)
}summarise_dataFrame(dataSet1)