Chapter 5 Programming Design

5.1 For each

5.1.1 For each observation

dataSet <- list(
  list(
    gender="male", height=178, age=32,
    children=list(
      list(gender="male", age=5),
      list(gender="male", age=0))
  ),
  list(
    gender="female", height=166, age=30,
    children=list(
      list(gender="female", age=3))
  ),
  list(
    gender="male", height=170, age=42,
    children=list(
      list(gender="male", age=10),
      list(gender="female", age=8))
  ),
  list(
    gender="male", height=188, age=22,
    children=list()
  ),  
  list(
    gender="male", height=172, age=52,
    children=list(
      list(gender="female", age=25),
      list(gender="female", age=23))
  ),
  list(
    gender="female", height=160, age=42,
    children=list(
      list(gender="female", age=11))
  )
)

For each observation, how many children does each have?

analysis <- list()
analysis$number_of_children <- list()

analysis$number_of_children[[1]] <-length(dataSet[[1]]$children)
analysis$number_of_children[[2]] <-length(dataSet[[2]]$children)
analysis$number_of_children[[3]] <-length(dataSet[[3]]$children)
analysis$number_of_children[[4]] <-length(dataSet[[4]]$children)
analysis$number_of_children[[5]] <-length(dataSet[[5]]$children)
analysis$number_of_children[[6]] <-length(dataSet[[6]]$children)

5.1.2 Information container

analysis$number_of_children is an information container to contain all the pieces of information that you want. Here, we use declare-then-add method to save our task information. When declaring an object, only the storage type is determined, but not its length. In computer science,

  • if the number of pieces of information (not each piece’s value) is already known,

it is more efficient to declare a fixed-length object with its length equal to the number of pieces of information first.

In R, we can do this via:

vector("storage_type", length)

So instead of

analysis$number_of_children <- list()

we should do

analysis$number_of_children <- vector("list", length=6)

analysis <- list()
analysis$number_of_children <- vector("list", length=6)

analysis$number_of_children[[1]] <-length(dataSet[[1]]$children)
analysis$number_of_children[[2]] <-length(dataSet[[2]]$children)
analysis$number_of_children[[3]] <-length(dataSet[[3]]$children)
analysis$number_of_children[[4]] <-length(dataSet[[4]]$children)
analysis$number_of_children[[5]] <-length(dataSet[[5]]$children)
analysis$number_of_children[[6]] <-length(dataSet[[6]]$children)

5.1.3 For loop

When for each… statement appears, you will see similar task repeated several times, the only difference is the position index. In computer science, for loop is a way to express repeating a task several times.

  • What task? (Task pattern)

  • How to repeat? (Iterate generation)

In our earlier task, there was a pattern repeated 6 times. If we replaced position 1, 2, …, 6 with .x, all we see will be:

analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)

Then the task pattern is:

analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)

For convenience of our discussion, let’s call it .x-pattern, where .x is called iterate.


We need computer to generate iterate from 1 to 6, and repeat .x-pattern each time with .x replaced with the iterate generated. We can use

  • for(.x in c(1, 2, 3, 4, 5, 6)){.x-pattern}

to do the job. This is called for loop.

for(.x in c(1, 2, 3, 4, 5, 6)){
  # .x-pattern
  analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
}

The complete for loop includes the container declaration:

analysis <- list()
analysis$number_of_children <- vector("list", length=6)
for(.x in c(1, 2, 3, 4, 5, 6)){
  # .x-pattern
  analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
}

class(analysis$number_of_children) # a list

Since c(1, 2, 3, 4, 5, 6) is based on the length of dataSet, we can express iterate generation as:

.x in 1:length(dataSet)

A better way is to use:

.x in seq_along(dataSet)
  • seq_along(.) does almost the same job as 1:length(.) except when dataSet length is zero only seq_along(.) would generate correct iterate (i.e. no iteration).

The best version now would be:

analysis <- list()
analysis$number_of_children <- vector("list", length=6)
for(.x in seq_along(dataSet)){
  # .x-pattern
  analysis$number_of_children[[.x]] <-length(dataSet[[.x]]$children)
}

class(analysis$number_of_children) # a list

The container creator:

vector("storage_type", length)

is to create container based on type – not on class. If you need your information to be certain class, you can always do parsing later – after you collect all the information.

dataSet1 <- c('1999-03-16','1996-07-14','1998-11-06','1999-07-16','1995-09-30','1997-09-08','1998-08-08','1999-03-26','1999-02-13','1997-07-04')

# declare a fixed length container based on information type before parsing
birthdays <- vector("character", length(dataSet1))
for(.x in seq_along(dataSet1)){
  birthdays[[.x]] <- dataSet1[[.x]]
}
# parse your collected information afterward.
birthdays |> lubridate::ymd() -> birthdays


Exercise 5.1 Load the following data set:

# Observation by observation
concerts_obo <-
  jsonlite::fromJSON("https://cloud.culture.tw/frontsite/trans/SearchShowAction.do?method=doFindTypeJ&category=17", simplifyDataFrame = F)

Data set concerts_obo has difference performance information as its observations. For each performance, there can be more than one show, and each show’s information is contained inside the observation’s showInfo element.

  1. How many shows does the 5th performance have?

  2. If we are to find out the number of shows for each performance, what is the .x-pattern?

  3. Find out the number of shows for each performance, and put the answer as an integer vector under analysis$number_of_shows. (i.e. analysis$number_of_shows[[5]] will show your answer in 1.)

  4. Construct a data frame with columns, title, date, location and locationName, where title comes from each observation’s title, and the rest three are from each observation’s 1st showInfo. In addition, date column has to be “POSIXct” “POSIXt” class with Taipei time zone.

flag=NA

if(as.logical(flag)){print("T")}else{print("F")}

5.1.4 For each group

dataSet1 <- 
  data.frame(
    team=c('team 2','team 3','team 1','team 2','team 3','team 1','team 2','team 1','team 3','team 2','team 3','team 1','team 3','team 1','team 2'),
    entranceYear = c(100,100,101,100,100,100,101,100,100,101,101,100,101,100,100),
    grade=c(98,95,76,87,67,83,84,94,70,64,99,77,89,62,65)
  )

For each team, compute its mean grade.

Solving the task requires first:

  • Form grouped data set: split your data set into different grouped data sets (i.e. the data set whose observations share certain feature value) based on some feature (i.e. team here).

Then

  • For each grouped data set, compute its mean grade.

Form grouped data set

split(dataSet1, dataSet1$team) -> grouped_dataset1

View(grouped_dataset1)
  • split(x, f) where f is supposed to be a factor–so dataSet1$team will be converted to factor behind your back.

Dealing with each grouped data set

For each grouped data set, compute its mean grade.

group_meanGrades <- vector("double", length(grouped_dataset1))

group_meanGrades[[1]] <- mean(grouped_dataset1[[1]]$grade)
group_meanGrades[[2]] <- mean(grouped_dataset1[[2]]$grade)
group_meanGrades[[3]] <- mean(grouped_dataset1[[3]]$grade)

Use for loop:

group_meanGrades <- vector("double", length(grouped_dataset1))
for(.x in seq_along(grouped_dataset1)){
  group_meanGrades[[.x]] <- mean(grouped_dataset1[[.x]]$grade)
}

Put all together. To solve:

For each team, compute its mean grade.

# Form grouped data sets
split(dataSet1, dataSet1$team) -> grouped_dataset1

# For each data sets, solve the required problem
group_meanGrades <- vector("double", length(grouped_dataset1))
for(.x in seq_along(grouped_dataset1)){
  group_meanGrades[[.x]] <- mean(grouped_dataset1[[.x]]$grade)
}

Exercise 5.2 Regarding the drug data from Exercise 4.8.

  1. For each kind of drug (毒品品項), what was the average amount of drug (the mean of 數量(淨重)_克) found?

  2. Add a column date to the data frame which is a date class created from the column 發生日期

For the following question, we only consider those 毒品品項 that contains the words “安非他命”.

  1. For each year, what was the total amount found? (Apply lubridate::year() to date column will get you year)

Multiple grouping factors

Sometimes we want to split data set based on more than one factors. In this case we need to create the interaction of these factors using : operator.


Split dataSet1 based on team and entranceYear features.

dataSet1$team |> factor() -> dataSet1$team
dataSet1$entranceYear |> factor() -> dataSet1$entranceYear
dataSet1$team:dataSet1$entranceYear -> .interaction
print(.interaction)

dataSet1 |> 
  split(.interaction) -> grouped_dataSets2

View(grouped_dataSets2)

5.2 Systematic analysis

It’s common for data analysis to follow certain procedures in its data summarisation stage.

dataSet3 <-
  data.frame(
    dates = c("2016-11-15", NA, NA, "1997-05-07", "1995-08-25", "2002-09-20", NA, NA, NA, "1995-07-16", "2011-06-22"),
    grades = c(29, 53, 26, 27, 55, 69, NA, NA, 63, NA, 56),
    genders = c("Male", "Female", "Male", "Male", "Female", "Female", NA, "Male", "Male", "Female", "Female"),
    majors = c("economics", "economics", NA, "economics", "economics", "economics", "economics", "statistics", "law", "economics", "law")
  )
# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
  class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check
# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
  na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check

The above two are very common to be applied to most data sets. It would be nice if instead of a long procedure:

# class check
class_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
  class_check[[.x]] <- class(dataSet3[[.x]])
}
names(class_check) <- names(dataSet3)
class_check

# NA check
na_check <- vector("integer", length(dataSet3))
for(.x in seq_along(dataSet3)){
  na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
}
names(na_check) <- names(dataSet3)
na_check

we can do something like:

class_check <- check_class(dataSet3)
na_check <- check_na(dataSet3)

The above lines need three things to happen:

  1. class_check <- and na_check <- need values to bind with, so check_class(dataSet3) and check_na(dataSet3) must return a value. (function return value)

  2. Both check_na and check_class functions must be searchable in your current working space. (check the Environment tab.) (function search path)

  3. check_class(dataSet3) and check_na(dataSet3) mean both functions are called forward to work with the value of dataSet3. (Computer science jargon is function call.) They must have sufficient objects to complete their attached programming block. (execution environment)

5.2.1 Function as reusable programming block

Function in general is a portable/reusable programming block.


Standard programming blocks:

# checking blocks for dataSet3
class_check <- {
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}


na_check <- {
  # NA check
  na_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
  }
  names(na_check) <- names(dataSet3)
  na_check # the last executed line
}
  • The names to bind the visible value of the last executed line (i.e. class_check <- {...} and na_check <- {...}) can be any name beside class_check and na_check.

If we have another data set needs to do those two checks:

dataSet4 = data.frame(
  x = c(6,10,10,6,10,6,NA,NA,3,7),
  y = c(2,3,6,6,8,2,8,6,8,3)
)

Unless the new data set is named dataSet3, you cannot re-run the previous check chunk.

  • You have to replace dataSet3 with dataSet4 if you simply copy/paste the checking programing blocks:
dataSet3 = data.frame(...)
# checking blocks for dataSet3
:
  
dataSet4 = data.frame(...)
dataSet3 <- dataSet4
# checking blocks for dataSet3

We can make both programming blocks portable by creating functions:

check_class <- function(){
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}


check_na <- function(){
  # NA check
  na_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
  }
  names(na_check) <- names(dataSet3)
  na_check # the last executed line
}
  • <- function(){...} will bind the programming block with the name you specified. Here are check_class and check_na.

  • The last executed line will return a value if it is a name call or return(name):

check_class <- function(){
  ...
  return(class_check) # the last executed line returned
}
check_na <- function(){
  ...
  return(na_check) # the last executed line returned
}

Whenever we need to re-use the programming blocks, we just do a function call.

# function definition
check_class <- function(){...}
check_na <- function(){...}

# a long program

dataSet3 = data.frame(...)
## need check here
class_check1 <- check_class()
na_check1 <- check_na()
## other part of the program

dataSet4 = data.frame(...)
dataSet3 = dataSet4
## need check here
class_check2 <- check_class()
na_check2 <- check_na()

It is equivalent to:

# a long program

dataSet3 = data.frame(...)
## need check here
class_check1 <- {
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}

na_check1 <- {
  # NA check
  na_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
  }
  names(na_check) <- names(dataSet3)
  na_check # the last executed line
}
## other part of the program

dataSet4 = data.frame(...)
dataSet3 = dataSet4
## need check here
class_check2 <- {
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}

na_check2 <- {
  # NA check
  na_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
  }
  names(na_check) <- names(dataSet3)
  na_check # the last executed line
}

Several tips to improve your programming efficiency.

  1. Whenever your programming lines are about solving a specific task, use programming blocks. Later if you want to reuse the programming block, you will know where to locate it and make it into a function.

  2. Your function name should start with a verb since it is about doing something.

Exercise 5.3 It is very common that economic model has some equilibrium condition as the intersection of two straight lines, like demand and supply. They can be expressed as: \[\mbox{demand: }p_d(q)=a+bq\\ \mbox{supply: }p_s(q)=c+dq \] In equilibrium, \[(p*, q*)\mbox{ where }p_d(q*)=p_s(q*)\mbox{ and }p*=p_d(q*)=p_s(q*).\]

If you solve for \(q*\) and \(p*\), \[q*=(a-c)/(d-b)\\ p*=a+bq*.\]

  1. Run the following code and solve for equilibrium and save it in a list with two element names q_star and p_star. Each has its value in corresponding to \(q*\) and \(p*\) that your program solved.
a=1; b=-3; c=0; d=1
  1. Create a function called solve_equilibrium so that the following code would work:
a=1; b=-3; c=0; d=1
equilibrium1 <- solve_equilibrium()
print(equilibrium1)

a=2; b=-3; c=0; d=1
equilibrium2 <- solve_equilibrium()
print(equilibrium2)

a=1; b=-3; c=0; d=1.5
equilibrium3 <- solve_equilibrium()
print(equilibrium3)

5.2.2 Function on search path

When we need to run the following codes:

# first check_class check_na function call
dataSet3 <- data.frame(...)
check_class()
check_na()
  • check_class and check_na have to be on the search path in your R session, which normally starts with Global Environment, followed by different packages.

  • RStudio: Environment window, Global Environment dropdown.

check_class and check_na are function names. check_class() and check_na() that uses the programming block that function carries are called function call.

To place your function in the global environment, you can do one of the two approaches.

Approach 1.

Define in the main script before their function call

check_class <- function(){
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}
check_na <- function(){...}

# first check_class check_na function call
dataSet3 <- data.frame(...)
check_class()
check_na()
  • Even though the programming block (aka function body) requires an object named dataSet3 in it. Those objects inside the function body does not have to exist at the time of definition.

    • R won’t check the existence of objects inside the programming block at the time of function creation. The existence will be checked only at the time of the function call. This property is called lazy evaluation in computer science.

5.2.2.1 Approach 2.

Because function definition does not need any object inside the programming block to exist, they can be saved in a totally different file and used later in a totally different environment.

Save all your function in a separate .R script file, say myFunctions.R, then source the script file before the first function call used.

source("myFunctions.R")

# first check_class check_na function call
dataSet3 <- data.frame(...)
check_class()
check_na()
  • This approach is better for organizing your program, and keep your main script flow more logically fluent.

5.2.3 Environments

After functions are on your search path, it’s time to understand how a function call is processed. To have a clear understanding, let’s imagine a new R session where the global environment is empty. Then run the following lines:

x = 1
y = 33

robot1 <- function(){
  x+y
  browser()
}

So the global environment is populated with two objects x and y and one robot robot1. (Indeed, function is more like a robot in real life!)

Now do a function call

robot1()

When objects are imported from the global environment, it will make a copy in its execution environment, and run the function body inside it.

Because objects in execution environment imported from Global environment are copies. Function body that changes object values will affect only the copy object but not the source object in the Global environment.

change_x_1stValue2a <- function(){
  x[1] <- "a"
}

x=c("b", "b", "b")
change_x_1stValue2a() # the copy of x in execution environment is changed
x # source x is unchanged

If you do want to change x value in the global environment, you need to do element value replacement, ie.

x = new_x # where new_x must be a return from change_x_1stValue2a function

Modify change_x_1stValue2a function

change_x_1stValue2a <- function(){
  x[1] <- "a"
  return(x)
}

x=c("b", "b", "b")
x=change_x_1stValue2a()
x

5.2.4 Input arguments

Sometimes you don’t want certain objects in function body to be copied and imported from Global environment, but to create a new one inside the execution environment. You can put those object name in function input arguments.

robot2 <- function(x){
  x+y
  browser()
}

Now do the function call

robot2(x=20)
  • Caller hands in a value for robot2 to bind with x in the execution environment. Or you can imagine caller tells robot2 what x is, then robot2 creates object x directly inside the execution environment – not copy/import from the global environment x.

  • Only y is copied/imported from the global environment.


Put object name as an input argument when creating the function can isolate the value of the same-name object in the global environment.


Earlier we have

dataSet4 = data.frame(...)
dataSet3 <- dataSet4
check_class()

If we put dataSet3 as check_class’s input argument, that is

check_class <- function(dataSet3){
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}

Then the function called to work on dataSet4 can be more straight forward:

dataSet4 = data.frame(...)
check_class(dataSet3 = dataSet4)
  • The input specification of the function call will not affect dataSet3 in the global environment. It simply tells check_class robot how to create a dataSet3 object in the execution environment directly.

  • dataSet3 will not be copied/imported from the Global environment.

In practice, input argument are object we want to create directly inside the execution environment under caller’s specification. Usually we will name the input argument in a more generous way instead of dataSet3, like:

check_class <- function(data_set){
  # class check
  class_check <- vector("integer", length(data_set))
  for(.x in seq_along(data_set)){
    class_check[[.x]] <- class(data_set[[.x]])
  }
  names(class_check) <- names(data_set)
  class_check # the last executed line
}

In function definition,

  • Input arguments ensure that those argument objects value are assigned from the function call, rather than copied/imported from the global environment.

  • For those objects in function body, they must be either defined by the function call (i.e. input argument), borrowed from the global environment, or an intermediate step by product inside the body.

myfun <- function(x){
  x+y -> z
  z**2+6
}
  • x will be defined in function call. y will be imported from the global environment, z is a byproduct of intermediate steps.
myfun(3)
y=5
myfun(3)

5.2.5 Advanced Concept of Environments

There are three places regarding function that are important:

  1. where it is lives,

  2. where it is called for a task,

  3. where it solves the task.

function environment

Place 1 is about where on R’s searching path we can find the function.

myFun <- function(){
  x+y
}
  • We can find myFun in the global environment. So the function environment of myFun is the global environment.

caller environment

Place 2 is called caller environment. It is where the function call happens. The function’s caller environment will be attached to its function environment as part of search path for the function to have the job done

Think of a function as a plumber whose lives in Taichung (function environment), gets a job call from Taipei (caller environment). When he needs some tools to do his job, he will start from his familiar Taichung to find those tools. If not found, he will try to find in Taipei.

myFun(x=3)
  • The function call happens in the global environment as well.

  • myFun plumber lives and gets a job in global environment.

execution environment

Place 3 is called execution environment. It is where the task content (aka function body) is done.

When a function does the job as the function call was specified, it creates a new environment to do the work – like an office. The environment is almost empty (no object to use), but it is attached to the function environment.

In many cases, at least for this course, those functions that we create live and get a job call in the same place, i.e. the Global environment. And the execution environment office is a place attached to the global environment. When the office needs some object for the job, but the object is not in the office, the function will go to global environment to look for it.

5.2.6 Extract function

Functions that import objects from Global environment can create future programming problem.

myfun <- function() xx + 5

# forget myfun needs to import xx 
myfun()

# there is a xx totally for different purpose
xx=c("a", "b")
myfun()

A good principle is unless you know what you are doing, otherwise never import objects from the global environment in function execution.

  • Always put objects the body needs (except the by product) in function arguments.
check_class <- function(dataSet3) {
  # class check
  class_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    class_check[[.x]] <- class(dataSet3[[.x]])
  }
  names(class_check) <- names(dataSet3)
  class_check # the last executed line
}


check_na <- function(dataSet3){
  # NA check
  na_check <- vector("integer", length(dataSet3))
  for(.x in seq_along(dataSet3)){
    na_check[[.x]] <- sum(is.na(dataSet3[[.x]]))
  }
  names(na_check) <- names(dataSet3)
  na_check # the last executed line
}

RStudio, select the programming block, then Code > Extract Function to create a function whose execution never import any object value from the global environment.

  • Remember to check the input arguments it picks up. A lot of time it will pick up package name, for loop iterates, etc. as well, which should not be input arguments
list_string = list(c("a b", "cd "), c("  a", " b"))
for(.x in seq_along(list_string)){
  stringr::str_remove_all(list_string[[.x]], '\\s') -> 
    list_string[[.x]]
}
list_string
  • it will pick up arguments (list_string, .x, stringr, str_remove_all)

5.2.7 Function usage and default

5.2.7.1 Argument name in function call

myfun2 <- function(x, y, z){ x+y+z}
myfun2(x=1, y=2, z=3)
myfun2(1, 2, 3) # ignore argument names if the order is correct
myfun2(z=3, x=1, y=2) # keep argument names can avoid wrong order problem

5.2.7.2 default

string = c("a, b", "a, c, b")
{
  stringr::str_split(string, ", ") -> string_splited
  table(unlist(string_splited))
}
tabulate_multipleSelections <- function(string){
  stringr::str_split(string, ", ") -> string_splited
  table(unlist(string_splited))
}
c("a, b", "a, c, b") |> tabulate_multipleSelections()
hobby = c(
  'sport, reading, movie',
  'sport',
  'movie, sport, reading',
  'movie, Reading',
  'sport')
hobby |> tabulate_multipleSelections()

Argument with default

Suppose sometimes you bumped into “;” separation data. But most of the time, it is “,”.

hobby2 = c(
  'sport; reading; movie',
  'sport',
  'movie; sport; reading',
  'movie; Reading',
  'sport')
tabulate_multipleSelections2 <- function(string, sep=","){
  pattern = paste(sep, '')
  stringr::str_split(string, pattern) -> string_splited
  table(unlist(string_splited))
}
hobby |> tabulate_multipleSelections2()
hobby2 |> tabulate_multipleSelections2(sep=";") # or
# tabulate_multipleSelections2(hobby2, sep=";")

Exercise 5.4 Import the wdi data from 4.8 Exercise-5 and obtain iso2c_nonCountry from Exercise 4.19

  1. The following code remove any non country entry in the data.
data_set <- wdi$data 
iso2c_nonCountry <- c('ZH','ZI','1A','S3','B8','V2','Z4','4E','T4','XC','Z7','7E','T7','EU','F1','XE','XD','XF','ZT','XH','XI','XG','V3','ZJ','XJ','T2','XL','XO','XM','XN','ZQ','XQ','T3','XP','XU','XY','OE','S4','S2','V4','V1','S1','8S','T5','ZG','ZF','T6','XT','1W')
pick_countries <- !(data_set$iso2c %in% iso2c_nonCountry)
data_set[pick_countries, ]

Use it to create a function remove_nonCountries so that any data frame, say df_example with iso2c in it can do the function call as the following to remove those non-country entries.

df_example <- remove_nonCountries(data_set=df_example)
  1. The following code remove non countries from the data set and narrow down further to year 2020 data. Then summarise the indicator’s mean, median, and range
wdi$data |> remove_nonCountries() -> data_set

data_set |> subset(year==2020) -> data_set2020 # it is the same as
code = "SG.GEN.PARL.ZS"
{
  data_set2020[[code]] |> range(na.rm=T) -> output_range
  data_set2020[[code]] |> mean(na.rm=T) -> output_mean
  data_set2020[[code]] |> median(na.rm=T) -> output_median
  list(
    mean=output_mean,
    median=output_median,
    range=list(output_range)
  ) |> list2DF()
}

Construct a function summarise_numerical which can be used to produce a summary data frame of mean, median, and range for any given data set (as input argument data_set) and a numerical feature column name (as input argument feature). In other words, with the help of summarise_numerical function, the above code chunk can be replace with

wdi$data |> remove_nonCountries() -> data_set

data_set |> subset(year==2020) -> data_set2020 # it is the same as
code = "SG.GEN.PARL.ZS"
summarise_numerical(data_set=data_set2020, feature=code)
  1. Gender inequality is an important issue in social science. One possible indicator to compare this inequality across countries is:
  • Proportion of seats held by women in national parliaments (%) (code name is “SG.GEN.PARL.ZS”).

What is the year range in the data set? For each year compute the mean of this indicator across countries. Is the trend of mean increasing over time?

  1. Create a function get_meanTrendOverYears when do the following function call, it will return a vector of the mean of all countries’ given code feature value over the years, with years as element names. (That is if mean is 2, 3, 8 for year 2010, 2011, 2012, then the returned vector should be the named numeric vector c("2010"=2, "2011"=3, "2012"=8).)
get_meanTrendOverYears(data_set=data_set, code="SG.GEN.PARL.ZS")

5.3 Task by situation

Consider the following programming flow:

A school organized welcoming talks for students of either Economics or Sociology major in different rooms. For Economics majors, it is room 201. For Sociology majors, it is room 301. You want to print:

  • Please go to room 201. (for Economics majors)

  • Please go to room 301. (for Sociology majors)

An incoming student’s major is

oneComingStudentMajor <- sample(c("Economics", "Sociology"), 1)

What should you print?

# oneComingStudentMajor == "Economics"
print("Please go to room 201.")

# oneComingStudentMajor == "Sociology"
print("Please go to room 301.")

5.3.1 if

In computer programming, we can create a conditional execution for each possible action using:

if(oneComingStudentMajor == "Economics"){
  print("Please go to room 201.")
}

if(oneComingStudentMajor == "Sociology"){
  print("Please go to room 301.")
}

For each conditional execution,

if(...){....}
  • {...} will be executed only if the (...) part is A TRUE value.

The (...) part is called a control flag. If the flag is up (i.e. TRUE), then the {...} will be released to be executed. If the flag is down (i.e. FALSE), then the {...} will be ignored (not executed).

Properties of a flag:

  • A flag is to control whether a programming block should be released to the programming flow.
# T flag will always release the block for execution
if(T){print("TRUE block")}
# F flag will never release the block for execution
if(F){print("FALSE block")}
  • A flag should only have ONE TRUE or FALSE. If flag is a vector, only the first one is used.
# T flag will always release the block for execution
if(c(T,T)){print("TT")}
if(c(T,F)){print("TF")}
if(c(F,F)){print("FF")}
if(c(F,T)){print("FT")}
# Use any() / all() if appropriate
if(c("a", "b")=="a"){print("case 1")}
if(any(c("a", "b")=="a")){print("case 2")}
if(all(c("a", "b")=="a")){print("case 3")}
  • Flag should NOT HAVE NA.
# use `isTRUE` to ensure the block is relased only when a legitimate flag value of TRUE happens
if(isTRUE(T)){print("case 1")}
if(isTRUE("a")){print("case 2")}
if(isTRUE(c(T,T))){print("case 3")}
if(isTRUE(NA)){print("case 4")}

# there is also `isFALSE`.

Set length warning into an error:

Sys.setenv("_R_CHECK_LENGTH_1_CONDITION_" = "true")

Rock-paper-scissor

rstudioapi::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice

computerChoice <- sample(c("r", "p", "s"), 1)

print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))

# If user won
print("You won.")

# If computer won,
print("Computer won.")

# If tie,
print("There is a tie.")

We need to design three flags for each execution:

if(flag_userwin){print("You won.")}
if(flag_computerwin){print("Computer won.")}
if(flag_tie){print("There is a tie.")}

A possible way to define these flags is:

combination <- paste0(userChoice, computerChoice)

flag_userwin <- combination %in% c("rs", "sp", "pr")
flag_computerwin <- combination %in% c("sr", "ps", "pr")
flag_tie <- combination %in% c("rr", "ss", "pp")

A complete program is:

rstudioapi::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice

computerChoice <- sample(c("r", "p", "s"), 1)

print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))

combination <- paste0(userChoice, computerChoice)

### three control flows
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")}
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")}
if(combination %in% c("rr", "ss", "pp")){print("There is a tie.")}

For the following discussion, we will repetitively use the part before the control flow. For the ease of learning, let us define a function:

get_rock_paper_scissor_combination <- function(){
  rstudioapi::showPrompt('', message="Input r for rock, p for paper, or s for scissor") -> userChoice
  
  computerChoice <- sample(c("r", "p", "s"), 1)
  
  print(glue::glue("You chose {userChoice}, computer chose {computerChoice}"))
  
  outcome <- c("user"=userChoice, "computer"=computerChoice)
  
  return(outcome)
}

5.3.2 else if

If only one of the flags can be TRUE, a more efficient control flow would be using else to connect all the if conditional blocks.

outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")

### one control flow
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(combination %in% c("rr", "ss", "pp")){print("There is a tie.")}
  • The else connected conditional executions is considered as one control flow.

  • One control flow release at most one block to execute: the first TRUE flag will trigger its attached programming block’s execution. After that, the remaining flags and the their attached blocks will be ignored.

5.3.3 else

If all the flags exhaust all possible situations, the last if(...) can be ignored.

outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")
# r > s > p > r
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
{print("There is a tie.")}

Three types of if control flow:

Three different control flows

if(flag){...}: release one block at most.

if(flag1){block1}else if(flag2){block2}...else if(flagN){blockN}: release one block at most–the first TRUE block will be released.

if(flag1){block1}else if(flag2){block2}...else if(flagM){blockM}else{blockN}: release definitely one block – either the first TRUE block, or blockN.

Any situation that does not make preceding flags true will be considered a TRUE for the last block. Weird thing can happen if you do not design flags properly. Try input anything other than “r”, “s”, “p”.

5.3.4 && and ||

outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")

flag_valid <- outcome[["user"]] %in% c("r", "s", "p")
# r > s > p > r
if(flag_valid & combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(flag_valid & combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(flag_valid){print("There is a tie.")} else
{print("Please input only r, s or p.")}

Though the above control flow works, it is not efficient. For &, computer will check all conditions. But in this case, if flag_valid is already FALSE, there is no need to check combination. Therefore, in control flow, it is better to use && to join multiple flag conditions.

  • ... && ... && ... && ... returns FALSE when it encounters the first FALSE and stop checking the conditions after that.
outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")

flag_valid <- outcome[["user"]] %in% c("r", "s", "p")
# r > s > p > r
if(flag_valid && combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(flag_valid && combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
if(flag_valid){print("There is a tie.")} else
{print("Please input only r, s or p.")}

The same situation also happens is for |. You should always use || for or-joining multiple flag conditions.

  • ... || ... || ... || ... returns TRUE when it encounters the first TRUE and stop checking the conditions after that.

There is also an efficient OR operator, ||. It is efficient in the sense that when a TRUE in the front is returned, no further condition check is needed, the result will be TRUE.

TRUE || x # will return TRUE even x is unknown.
TRUE | x # will return ERROR
FALSE && x # will return FALSE even x is unknown.
FALSE & x # will return ERROR

When construct flags, you should only use && and || instead of & and |. The latter are for vector comparison.

Since control flow will only release the 1st TRUE block, any preceding blocks not released mean their flags are all FALSE.

if(flag1){block1} else
if(flag2){block2} else
{block3} 

is the same as

if(flag1){block1} else
if(!flag1 && flag2){block2} else
if(!flag1 && !flag2){block3}

Another control flow design:

outcome <- get_rock_paper_scissor_combination()
combination <- paste(outcome, collapse="")

flag_valid <- outcome[["user"]] %in% c("r", "s", "p")
# r > s > p > r
if(!flag_valid){
  print("Please input only r, s or p.")
} else
if(combination %in% c("rs", "sp", "pr")){print("You won.")} else
if(combination %in% c("sr", "ps", "rp")){print("Computer won.")} else
{print("There is a tie.")}

5.4 An example

Given the data set:

dataSet1 <- 
  data.frame(
    cat = c('a','a','b','a','c','a','b','a','b','a','b','a','a','b','a'),
    num=c(75,81,68,81,68,85,58,55,70,83,78,83,67,87,70))

In preliminary data summary stage, for categorical data (mainly means categorical data), we want to summarise its count number in each category, its highest count category (i.e. mode). For numerical data, we want to compute its mean and range, etc.

  • The summarisation procedures depends on what class the data vector is.

    • if data vector is categorical, tabulate and find mode.

    • if data vector is numeric, compute mean and range.

5.4.0.1 Design control flow

# summarise a data vector, say data_vector

# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)

# (2) Assemble control flow
if(flag_categorical){
  summarise_categorical(data_vector)
} else if(flag_numerical){
  summarise_numerical(data_vector)
}
  • The paths here are expressed as a function call since we can always make a programming block portable by defining a function. Then the execution of the programming block will be a function call.

Path of summarise categorical:

{
  output_table = table(data_vector)
  output_mode = output_table[(output_table == max(output_table))]
  list(
    table=output_table,
    mode=output_mode
  )
}

Path of summarise numerical:

{
  output_mean = mean(data_vector, na.rm=T)
  output_range = range(data_vector, na.rm=T)
  list(
    mean=output_mean,
    range=output_range
  )
}

Complete control flow:

# summarise a data vector, say data_vector

# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)

# (2) Assemble control flow
if(flag_categorical){
  output_table = table(data_vector)
  output_mode = output_table[(output_table == max(output_table))]
  list(
    table=output_table,
    mode=output_mode
  )
} else if(flag_numerical){
  output_mean = mean(data_vector, na.rm=T)
  output_range = range(data_vector, na.rm=T)
  list(
    mean=output_mean,
    range=output_range
  )
}

Or

Path of summarise categorical:

summarise_categorical <- function(data_vector){
  output_table = table(data_vector)
  output_mode = output_table[(output_table == max(output_table))]
  list(
    table=output_table,
    mode=output_mode
  )
}

summarise_numerical <- function(data_vector){
  output_mean = mean(data_vector, na.rm=T)
  output_range = range(data_vector, na.rm=T)
  list(
    mean=output_mean,
    range=output_range
  )
}

OR

# summarise a data vector, say data_vector
# (0) load summarise path function (can also put in a source .R script then source it first)
summarise_categorical <- function(data_vector){
  output_table = table(data_vector)
  output_mode = output_table[(output_table == max(output_table))]
  list(
    table=output_table,
    mode=output_mode
  )
}

summarise_numerical <- function(data_vector){
  output_mean = mean(data_vector, na.rm=T)
  output_range = range(data_vector, na.rm=T)
  list(
    mean=output_mean,
    range=output_range
  )
}

# (1) construct flags
flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
flag_numerical <- is.numeric(data_vector)

# (2) Assemble control flow
if(flag_categorical){
  summarise_categorical(data_vector)
} else if(flag_numerical){
  summarise_numerical(data_vector)
}

We can pack the above program into a big block and define a function to carry it.

# summarise a data vector, say data_vector
# (0) load summarise path function (can also put in a source .R script then source it first)
summarise_categorical <- function(data_vector){
  output_table = table(data_vector)
  output_mode = output_table[(output_table == max(output_table))]
  list(
    table=output_table,
    mode=output_mode
  )
}

summarise_numerical <- function(data_vector){
  output_mean = mean(data_vector, na.rm=T)
  output_range = range(data_vector, na.rm=T)
  list(
    mean=output_mean,
    range=output_range
  )
}

summarise_data_vector <- function(data_vector) {
  # (1) construct flags
  flag_categorical <- is.character(data_vector) || is.factor(data_vector) || is.logical(data_vector)
  flag_numerical <- is.numeric(data_vector)
  
  # (2) Assemble control flow
  if(flag_categorical){
    summarise_categorical(data_vector)
  } else if(flag_numerical){
    summarise_numerical(data_vector)
  }
}
summarise_data_vector(dataSet1$cat)
summarise_data_vector(dataSet1$num)

An even better one is to add a function call summarise_dataFrame:

summarise_dataFrame <- function(dataSet1){
  .summary <- vector("list", length(dataSet1))
  for(.x in seq_along(dataSet1)){
    .summary[[.x]] <- summarise_data_vector(dataSet1[[.x]])
  }
  names(.summary) <- names(dataSet1) # element names as data set column names
  return(.summary)
}
summarise_dataFrame(dataSet1)