R Programming Language Concepts II

Conditional Statements, Loops, and Packages

Author

Pei-Hsun Hsieh

Conditional Statements

Conditional statements are used to instruct the computer to execute specific code based on whether a condition is met, using if and else.

if (Condition) { Code is executed if Condition is TRUE}else{ Code is executed if Condition is FALSE}

For example, if a student’s grade percentage is greater than or equal to 60, we want to print “Pass”; otherwise, we print “Fail”. Suppose the student’s percentage is saved in an object called percentage1. Here’s how you can use conditional statements in R to check the result:

percentage1 <- 70

if (percentage1 >= 60) {
  print("Pass")
} else {
  print("Fail")
}
[1] "Pass"

The condition inside the parentheses should be a Boolean value (TRUE or FALSE):

percentage1 >= 60
[1] TRUE

In this case, it returns TRUE, so R will execute the code inside the if block. If there is a second student whose grade is 55, since the condition in if is not met, R will execute the code inside the else block.

percentage2 <- 55
if (percentage2 >= 60) {
  print("Pass")
} else {
  print("Fail")
}
[1] "Fail"

Sometimes, we need to check multiple conditions. In R, you can use else if for additional conditions. Each new condition should follow the pattern else if (Condition). For example, to assign letter grades based on a student’s percentage, we can structure the code as follows:

if(percentage1 >= 90) {
  print("Letter grade: A")
} else if (percentage1 >= 80 & percentage1 < 90) {
  print("Letter grade: B")
} else if (percentage1 >= 70 & percentage1 < 80) {
  print("Letter grade: C")
} else if (percentage1 >= 60 & percentage1 < 70) {
  print("Letter grade: D")
} else {
  print("Letter grade: F")
}
[1] "Letter grade: C"

In this example, the code evaluates the percentage and prints the corresponding letter grade based on the student’s score.

Single-Line if-else

If you want to assign a value based on a condition, rather than executing multiple lines of code, you can use the ifelse() function. For example, to assign a “Pass” or “Fail” value to Letter_Grade based on percentage1, you can write:

Pass_Fail <- ifelse(percentage1 >= 60, "Pass", "Fail")
Pass_Fail
[1] "Pass"

This is particularly useful when working with vectors. You can apply the ifelse() function to create a new vector based on conditions for each element. For instance, if we have the percentages for three students, we can do the following:

percentages <- c(75, 58, 90)
Pass_Fails <- ifelse(percentages >= 60, "Pass", "Fail")
Pass_Fails
[1] "Pass" "Fail" "Pass"
Tip

When an object can have only two possible values (like “Pass” or “Fail”), it’s often more efficient to use Boolean values (TRUE or FALSE) instead of strings. This saves memory space and allows for easier manipulation in future operations. For example, we can create a vector Pass_or_not and assign TRUE for pass and FALSE for fail:

Pass_or_not <- ifelse(percentages >= 60, TRUE, FALSE)
Pass_or_not
[1]  TRUE FALSE  TRUE

Let’s compare the memory usage of a string vector, Pass_Fail, and the Boolean vector, Pass_or_not:

object.size(Pass_Fail)    
112 bytes
object.size(Pass_or_not) 
64 bytes

As we can see, the Boolean vector uses significantly less RAM!

Loops

Sometimes, we want to repeat the same task (or the same task with different inputs) multiple times. For example, if we want to calculate and display the values of \(3\), \(3^2\), \(3^3\), \(3^4\), …, \(3^{20}\):

3
[1] 3
3^2
[1] 9
3^3
[1] 27
# This would be too time-consuming to do manually!

In programming, we can automate this using loops, allowing us to avoid copying and pasting code repeatedly.

For Loops

To use a for loop, you need a vector to loop over, a variable whose value will change each iteration (in parentheses after for), and curly brackets {} that contain the code you want to repeat. For example,

for(i in 1:20){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20

You can also loop over a string vector:

PPE_courses <- c("Strategic Reasoning", "Behavioral Economics and Psychology", 
                 "Computational Text Analysis for Social Sciences", 
                 "Cooperation: Addressing Contemporary Societal Challenges", 
                 "Racial and Ethnic Politics", "Corruption", "Obedience", 
                 "Modeling Choice Behavior", "Cooperative Altruism")
for(course in PPE_courses){
  print(course)
}
[1] "Strategic Reasoning"
[1] "Behavioral Economics and Psychology"
[1] "Computational Text Analysis for Social Sciences"
[1] "Cooperation: Addressing Contemporary Societal Challenges"
[1] "Racial and Ethnic Politics"
[1] "Corruption"
[1] "Obedience"
[1] "Modeling Choice Behavior"
[1] "Cooperative Altruism"

If we want to calculate \(3\) raised to powers from 1 to 20, we can set up the loop like this:

for(i in 1:20){
  print(3^i)
}
[1] 3
[1] 9
[1] 27
[1] 81
[1] 243
[1] 729
[1] 2187
[1] 6561
[1] 19683
[1] 59049
[1] 177147
[1] 531441
[1] 1594323
[1] 4782969
[1] 14348907
[1] 43046721
[1] 129140163
[1] 387420489
[1] 1162261467
[1] 3486784401

We can also calculate the sum of \(3\), \(3^2\), \(3^3\), …, \(3^{20}\) by using a for loop. We create an object sum_3i and add a new value to this object in each iteration:

sum_3i <- 0

for(i in 1:20){
  sum_3i <- sum_3i + 3^i
}
sum_3i
[1] 5230176600
Tip

For loops in R can be slow, so vectorization is often a more efficient approach when it’s available. Let’s compare the time it takes to calculate the sum of \(3\) raised to powers from 1 to 20 using a for loop and vectorization.

Using a for loop:

start_time <- Sys.time()
sum_3i <- 0
for(i in 1:20){
  sum_3i <- sum_3i + 3^i
}
sum_3i
[1] 5230176600
end_time <- Sys.time()
end_time - start_time
Time difference of 0.004309177 secs

Using vectorization:

start_time <- Sys.time()
sum(3^(1:20))
[1] 5230176600
end_time <- Sys.time()
end_time - start_time
Time difference of 0.001245975 secs

You’ll notice that the for loop takes longer! While this may not matter for small tasks, it can make a significant difference in more complex code. However, vectorization cannot always be applied, such as in web scraping tasks.

Break and Next Statements

break and next are useful tools when working with loops in R.

  • break immediately stops the execution of the loop.
  • next skips the remaining code in the current iteration and moves on to the next iteration.

break

If we want to stop a loop when i becomes larger than 10:

for(i in 1:20){
  print(i)
  if(i > 10){
    break
  }
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11

Notice that the position of break matters. In programming, code is executed line by line in sequence. The following example behaves differently, as the print() statement is placed after the break condition:

for(i in 1:20){
  if(i > 10){
    break
  }
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

next

We can use next to skip certain iterations. For example, to skip printing odd numbers between 1 and 20:

for(i in 1:20){
  if(i %% 2 != 0){
    next
  }
  print(i)
}
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20

Again, the order of the code is important. In the following example, next is placed after print(), so it won’t skip the print() function:

for(i in 1:20){
  print(i)
  if(i %% 2 != 0){
    next
  }
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20

In this case, the odd numbers will still be printed because next skips the remaining lines after it.

While Loops

A while loop is another type of loop in R. Unlike for loops that iterate over a vector, a while loop continues executing as long as the condition inside the parentheses remains TRUE.

For example, the following while loop increments i by 1 and prints its value until i reaches 10:

i <- 0
while(i < 10){
  i <- i + 1
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

In this case, the loop keeps running while i is less than 10. Once i reaches 10, the condition i < 10 becomes FALSE, and the loop stops.

Packages

A package in R is a collection of functions, often created and published by other programmers, that extends R’s capabilities. These packages save time by allowing us to use pre-built functions instead of writing everything from scratch. For example, the dplyr package provides powerful tools for data manipulation.

To install a package in R, use install.packages("package_name") (note that the package name must be in quotation marks):

install.packages("dplyr")
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror

You only need to install a package once on your computer. However, each time you restart R, you need to load the package using library(package_name) (without quotation marks):

library(dplyr)
Warning: package 'dplyr' was built under R version 4.3.1

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Note

When you load a package using library(), all of the package’s functions are loaded into memory.

There is also a package called pacman, which simplifies package management. Its function p_load can install packages if they aren’t installed yet and load them into your session. However, pacman itself needs to be installed manually (it cannot install itself). To install pacman, run:

install.packages("pacman")

Once installed, you can use p_load() to install and load packages in one step.

Additionally, you can use a function from a package without loading the entire package into memory by specifying the package name before the function with ::. For example, to use the p_load() function from pacman without loading all of pacman into memory:

pacman::p_load(dplyr, readr)

In this case, since dplyr is already installed, p_load() will simply load it. If readr is not installed, p_load() will install and load it.

Loading and Writing Data

In R, you can save all the objects from your current session using save.image(). This saves everything into an .RData file:

save.image(file = "PPE4000_Sep11.RData")

To load the .RData file back into your session, use the load() function:

load("PPE4000_Sep11.RData")

If you want to save a single object instead of the entire session, you can use saveRDS(). This is particularly useful for saving specific or more complex data types:

saveRDS(PPE_courses, "PPE_courses.RDS")

Let’s remove PPE_courses to demonstrate how to read it back.

rm(PPE_courses)

To load the object, use readRDS():

PPE_courses <- readRDS("PPE_courses.RDS")

Reading Data from Other Formats

R can handle various data formats, with different packages and functions available for each. One of the most common formats is .csv (Comma Separated Values). For reading and writing .csv files, the readr package provides efficient tools.

Visit The Climate Change Twitter Dataset, download The Climate Change Twitter Dataset.csv (2.09 GB), and unzip it.

To load the dataset, we’ll use the read_csv() function from the readr package. If the data file is located in R’s current working directory, you only need to specify the file name. If it’s in another directory, provide the full path. Remember that in R, backslashes (\) are escape characters, so replace them with forward slashes (/) or double backslashes (\\).

You can check your current working directory with:

getwd()
[1] "C:/Users/phsieh/OneDrive/04 Teaching/Computational Text Analysis/2024 Fall UPenn/01 R Language Programming Language Basic"

And change it using setwd(new_directory) if needed.

Now, let’s load the .csv file. Since the dataset is very large, we’ll load just the first 100 rows for practice by setting n_max = 100:

CC_Tweets <- read_csv("The Climate Change Twitter Dataset.csv", n_max = 100)
Rows: 100 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): topic, stance, gender, aggressiveness
dbl  (5): id, lng, lat, sentiment, temperature_avg
dttm (1): created_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The read_csv() function reads the .csv file into a data frame.

You can check the column names (variables) in this data frame using the colnames() function:

colnames(CC_Tweets)
 [1] "created_at"      "id"              "lng"             "lat"            
 [5] "topic"           "sentiment"       "stance"          "gender"         
 [9] "temperature_avg" "aggressiveness" 

Exercise

  1. Please use a for loop and conditional statements to print the sentiments of the tweets from female from CC_Tweets.
  2. Please create a new column to CC_Tweets called pos_sentiment, and the values are TRUE if the sentiment is positive, and FALSE if the sentiment is negative.

We will talk about more how to use dplyr in next class, as well as string manipulations in R!

Data Wrangling

dplyr is a powerful R package designed for data manipulation and transformation (see the documentation here). One of its core features is the pipe operator (%>%), which lets you pass the output of one function directly into the next. From R version 4.1.0 onward, you can also use the native pipe operator (|>), but in this guide, we’ll use %>%. Below, we’ll explore a few essential functions for data wrangling using dplyr.

filter() – Picks cases based on their values

The filter() function allows you to subset rows based on a condition. For example, to filter tweets related to “Weather Extremes”:

CC_Tweets %>% filter(topic == "Weather Extremes")
# A tibble: 38 × 10
   created_at               id     lng   lat topic       sentiment stance gender
   <dttm>                <dbl>   <dbl> <dbl> <chr>           <dbl> <chr>  <chr> 
 1 2006-06-06 16:06:42    6132   NA     NA   Weather Ex…   -0.0972 neutr… female
 2 2006-07-23 21:52:30   13275  -73.9   40.7 Weather Ex…    0.576  neutr… undef…
 3 2006-08-29 01:52:30   23160   NA     NA   Weather Ex…    0.500  neutr… male  
 4 2006-11-07 02:46:52   57868   NA     NA   Weather Ex…    0.0328 neutr… male  
 5 2006-12-17 19:43:09 1278023  -79.8   36.1 Weather Ex…   -0.565  denier male  
 6 2006-12-21 01:39:01 1455543 -122.    38.0 Weather Ex…    0.651  neutr… male  
 7 2006-12-31 10:47:25 1893063   -1.90  52.5 Weather Ex…    0.671  neutr… male  
 8 2007-01-06 17:36:51 2266613  -73.9   40.7 Weather Ex…   -0.568  neutr… male  
 9 2007-01-06 18:08:03 2268453   NA     NA   Weather Ex…   -0.202  neutr… male  
10 2007-01-07 22:44:24 2332873   NA     NA   Weather Ex…   -0.395  neutr… male  
# ℹ 28 more rows
# ℹ 2 more variables: temperature_avg <dbl>, aggressiveness <chr>

select() – Selects specific columns

The select() function helps you choose specific columns from a dataset. Here, we’ll select the created_at and topic columns:

CC_Tweets %>% select(created_at, topic)
# A tibble: 100 × 2
   created_at          topic                                  
   <dttm>              <chr>                                  
 1 2006-06-06 16:06:42 Weather Extremes                       
 2 2006-07-23 21:52:30 Weather Extremes                       
 3 2006-08-29 01:52:30 Weather Extremes                       
 4 2006-11-07 02:46:52 Weather Extremes                       
 5 2006-11-27 14:27:43 Importance of Human Intervantion       
 6 2006-11-29 23:21:04 Seriousness of Gas Emissions           
 7 2006-12-11 22:08:14 Ideological Positions on Global Warming
 8 2006-12-14 01:39:10 Ideological Positions on Global Warming
 9 2006-12-17 19:43:09 Weather Extremes                       
10 2006-12-21 01:39:01 Weather Extremes                       
# ℹ 90 more rows

mutate() – Adds new variables or modifies existing ones

The mutate() function allows you to create new variables or modify existing ones. For example, let’s create a new variable sentiment_category based on the sentiment scores. dplyr also offers a single-line function for multiple conditions, case_when(). Here, we use case_when() to categorize sentiment values. Tweets with a sentiment score less than or equal to -0.1 are labeled as “Negative,” those between -0.1 and 0.1 as “Neutral,” and those greater than 0.1 as “Positive.”

CC_Tweets <- CC_Tweets %>% mutate(sentiment_category = case_when(
  sentiment <= -0.1 ~ "Negative",
  sentiment > -0.1 & sentiment <= 0.1 ~ "Neutral",
  sentiment > 0.1 ~ "Positive"
))

Exercise

  1. Please use a function in the dplyr package to show tweets from female from CC_Tweets.
  2. Please create a new column to CC_Tweets called pos_sentiment, and the values are TRUE if the sentiment is positive, and FALSE if the sentiment is negative, by a function in dplyr.