<- 70
percentage1
if (percentage1 >= 60) {
print("Pass")
else {
} print("Fail")
}
[1] "Pass"
Conditional Statements, Loops, and Packages
Conditional statements are used to instruct the computer to execute specific code based on whether a condition is met, using if
and else
.
if (Condition) { Code is executed if Condition is TRUE}else{ Code is executed if Condition is FALSE}
For example, if a student’s grade percentage is greater than or equal to 60, we want to print “Pass”; otherwise, we print “Fail”. Suppose the student’s percentage is saved in an object called percentage1
. Here’s how you can use conditional statements in R to check the result:
<- 70
percentage1
if (percentage1 >= 60) {
print("Pass")
else {
} print("Fail")
}
[1] "Pass"
The condition inside the parentheses should be a Boolean value (TRUE
or FALSE
):
>= 60 percentage1
[1] TRUE
In this case, it returns TRUE
, so R will execute the code inside the if
block. If there is a second student whose grade is 55, since the condition in if
is not met, R will execute the code inside the else
block.
<- 55
percentage2 if (percentage2 >= 60) {
print("Pass")
else {
} print("Fail")
}
[1] "Fail"
Sometimes, we need to check multiple conditions. In R, you can use else if
for additional conditions. Each new condition should follow the pattern else if (Condition)
. For example, to assign letter grades based on a student’s percentage, we can structure the code as follows:
if(percentage1 >= 90) {
print("Letter grade: A")
else if (percentage1 >= 80 & percentage1 < 90) {
} print("Letter grade: B")
else if (percentage1 >= 70 & percentage1 < 80) {
} print("Letter grade: C")
else if (percentage1 >= 60 & percentage1 < 70) {
} print("Letter grade: D")
else {
} print("Letter grade: F")
}
[1] "Letter grade: C"
In this example, the code evaluates the percentage and prints the corresponding letter grade based on the student’s score.
If you want to assign a value based on a condition, rather than executing multiple lines of code, you can use the ifelse()
function. For example, to assign a “Pass” or “Fail” value to Letter_Grade
based on percentage1
, you can write:
<- ifelse(percentage1 >= 60, "Pass", "Fail")
Pass_Fail Pass_Fail
[1] "Pass"
This is particularly useful when working with vectors. You can apply the ifelse()
function to create a new vector based on conditions for each element. For instance, if we have the percentages for three students, we can do the following:
<- c(75, 58, 90)
percentages <- ifelse(percentages >= 60, "Pass", "Fail")
Pass_Fails Pass_Fails
[1] "Pass" "Fail" "Pass"
When an object can have only two possible values (like “Pass” or “Fail”), it’s often more efficient to use Boolean values (TRUE
or FALSE
) instead of strings. This saves memory space and allows for easier manipulation in future operations. For example, we can create a vector Pass_or_not
and assign TRUE
for pass and FALSE
for fail:
<- ifelse(percentages >= 60, TRUE, FALSE)
Pass_or_not Pass_or_not
[1] TRUE FALSE TRUE
Let’s compare the memory usage of a string vector, Pass_Fail
, and the Boolean vector, Pass_or_not
:
object.size(Pass_Fail)
112 bytes
object.size(Pass_or_not)
64 bytes
As we can see, the Boolean vector uses significantly less RAM!
Sometimes, we want to repeat the same task (or the same task with different inputs) multiple times. For example, if we want to calculate and display the values of \(3\), \(3^2\), \(3^3\), \(3^4\), …, \(3^{20}\):
3
[1] 3
3^2
[1] 9
3^3
[1] 27
# This would be too time-consuming to do manually!
In programming, we can automate this using loops, allowing us to avoid copying and pasting code repeatedly.
To use a for
loop, you need a vector to loop over, a variable whose value will change each iteration (in parentheses after for
), and curly brackets {}
that contain the code you want to repeat. For example,
for(i in 1:20){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
You can also loop over a string vector:
<- c("Strategic Reasoning", "Behavioral Economics and Psychology",
PPE_courses "Computational Text Analysis for Social Sciences",
"Cooperation: Addressing Contemporary Societal Challenges",
"Racial and Ethnic Politics", "Corruption", "Obedience",
"Modeling Choice Behavior", "Cooperative Altruism")
for(course in PPE_courses){
print(course)
}
[1] "Strategic Reasoning"
[1] "Behavioral Economics and Psychology"
[1] "Computational Text Analysis for Social Sciences"
[1] "Cooperation: Addressing Contemporary Societal Challenges"
[1] "Racial and Ethnic Politics"
[1] "Corruption"
[1] "Obedience"
[1] "Modeling Choice Behavior"
[1] "Cooperative Altruism"
If we want to calculate \(3\) raised to powers from 1 to 20, we can set up the loop like this:
for(i in 1:20){
print(3^i)
}
[1] 3
[1] 9
[1] 27
[1] 81
[1] 243
[1] 729
[1] 2187
[1] 6561
[1] 19683
[1] 59049
[1] 177147
[1] 531441
[1] 1594323
[1] 4782969
[1] 14348907
[1] 43046721
[1] 129140163
[1] 387420489
[1] 1162261467
[1] 3486784401
We can also calculate the sum of \(3\), \(3^2\), \(3^3\), …, \(3^{20}\) by using a for
loop. We create an object sum_3i
and add a new value to this object in each iteration:
<- 0
sum_3i
for(i in 1:20){
<- sum_3i + 3^i
sum_3i
} sum_3i
[1] 5230176600
For loops in R can be slow, so vectorization is often a more efficient approach when it’s available. Let’s compare the time it takes to calculate the sum of \(3\) raised to powers from 1 to 20 using a for
loop and vectorization.
Using a for
loop:
<- Sys.time()
start_time <- 0
sum_3i for(i in 1:20){
<- sum_3i + 3^i
sum_3i
} sum_3i
[1] 5230176600
<- Sys.time()
end_time - start_time end_time
Time difference of 0.004309177 secs
Using vectorization:
<- Sys.time()
start_time sum(3^(1:20))
[1] 5230176600
<- Sys.time()
end_time - start_time end_time
Time difference of 0.001245975 secs
You’ll notice that the for
loop takes longer! While this may not matter for small tasks, it can make a significant difference in more complex code. However, vectorization cannot always be applied, such as in web scraping tasks.
break
and next
are useful tools when working with loops in R.
break
immediately stops the execution of the loop.next
skips the remaining code in the current iteration and moves on to the next iteration.break
If we want to stop a loop when i
becomes larger than 10:
for(i in 1:20){
print(i)
if(i > 10){
break
} }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
Notice that the position of break
matters. In programming, code is executed line by line in sequence. The following example behaves differently, as the print()
statement is placed after the break
condition:
for(i in 1:20){
if(i > 10){
break
}print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
next
We can use next
to skip certain iterations. For example, to skip printing odd numbers between 1 and 20:
for(i in 1:20){
if(i %% 2 != 0){
next
}print(i)
}
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20
Again, the order of the code is important. In the following example, next
is placed after print()
, so it won’t skip the print()
function:
for(i in 1:20){
print(i)
if(i %% 2 != 0){
next
} }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
[1] 17
[1] 18
[1] 19
[1] 20
In this case, the odd numbers will still be printed because next
skips the remaining lines after it.
A while
loop is another type of loop in R. Unlike for
loops that iterate over a vector, a while
loop continues executing as long as the condition inside the parentheses remains TRUE
.
For example, the following while
loop increments i
by 1 and prints its value until i
reaches 10:
<- 0
i while(i < 10){
<- i + 1
i print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
In this case, the loop keeps running while i
is less than 10. Once i
reaches 10, the condition i < 10
becomes FALSE
, and the loop stops.
A package in R is a collection of functions, often created and published by other programmers, that extends R’s capabilities. These packages save time by allowing us to use pre-built functions instead of writing everything from scratch. For example, the dplyr
package provides powerful tools for data manipulation.
To install a package in R, use install.packages("package_name")
(note that the package name must be in quotation marks):
install.packages("dplyr")
Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
You only need to install a package once on your computer. However, each time you restart R, you need to load the package using library(package_name)
(without quotation marks):
library(dplyr)
Warning: package 'dplyr' was built under R version 4.3.1
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
When you load a package using library()
, all of the package’s functions are loaded into memory.
There is also a package called pacman
, which simplifies package management. Its function p_load
can install packages if they aren’t installed yet and load them into your session. However, pacman
itself needs to be installed manually (it cannot install itself). To install pacman
, run:
install.packages("pacman")
Once installed, you can use p_load()
to install and load packages in one step.
Additionally, you can use a function from a package without loading the entire package into memory by specifying the package name before the function with ::
. For example, to use the p_load()
function from pacman
without loading all of pacman
into memory:
::p_load(dplyr, readr) pacman
In this case, since dplyr
is already installed, p_load()
will simply load it. If readr
is not installed, p_load()
will install and load it.
In R, you can save all the objects from your current session using save.image()
. This saves everything into an .RData
file:
save.image(file = "PPE4000_Sep11.RData")
To load the .RData
file back into your session, use the load()
function:
load("PPE4000_Sep11.RData")
If you want to save a single object instead of the entire session, you can use saveRDS()
. This is particularly useful for saving specific or more complex data types:
saveRDS(PPE_courses, "PPE_courses.RDS")
Let’s remove PPE_courses
to demonstrate how to read it back.
rm(PPE_courses)
To load the object, use readRDS()
:
<- readRDS("PPE_courses.RDS") PPE_courses
R can handle various data formats, with different packages and functions available for each. One of the most common formats is .csv
(Comma Separated Values). For reading and writing .csv
files, the readr
package provides efficient tools.
Visit The Climate Change Twitter Dataset, download The Climate Change Twitter Dataset.csv (2.09 GB), and unzip it.
To load the dataset, we’ll use the read_csv()
function from the readr
package. If the data file is located in R’s current working directory, you only need to specify the file name. If it’s in another directory, provide the full path. Remember that in R, backslashes (\
) are escape characters, so replace them with forward slashes (/
) or double backslashes (\\
).
You can check your current working directory with:
getwd()
[1] "C:/Users/phsieh/OneDrive/04 Teaching/Computational Text Analysis/2024 Fall UPenn/01 R Language Programming Language Basic"
And change it using setwd(new_directory)
if needed.
Now, let’s load the .csv
file. Since the dataset is very large, we’ll load just the first 100 rows for practice by setting n_max = 100
:
<- read_csv("The Climate Change Twitter Dataset.csv", n_max = 100) CC_Tweets
Rows: 100 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): topic, stance, gender, aggressiveness
dbl (5): id, lng, lat, sentiment, temperature_avg
dttm (1): created_at
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The read_csv()
function reads the .csv
file into a data frame.
You can check the column names (variables) in this data frame using the colnames()
function:
colnames(CC_Tweets)
[1] "created_at" "id" "lng" "lat"
[5] "topic" "sentiment" "stance" "gender"
[9] "temperature_avg" "aggressiveness"
CC_Tweets
.CC_Tweets
called pos_sentiment
, and the values are TRUE
if the sentiment is positive, and FALSE
if the sentiment is negative.We will talk about more how to use dplyr
in next class, as well as string manipulations in R!
dplyr
is a powerful R package designed for data manipulation and transformation (see the documentation here). One of its core features is the pipe operator (%>%
), which lets you pass the output of one function directly into the next. From R version 4.1.0 onward, you can also use the native pipe operator (|>
), but in this guide, we’ll use %>%
. Below, we’ll explore a few essential functions for data wrangling using dplyr
.
filter()
– Picks cases based on their valuesThe filter()
function allows you to subset rows based on a condition. For example, to filter tweets related to “Weather Extremes”:
%>% filter(topic == "Weather Extremes") CC_Tweets
# A tibble: 38 × 10
created_at id lng lat topic sentiment stance gender
<dttm> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <chr>
1 2006-06-06 16:06:42 6132 NA NA Weather Ex… -0.0972 neutr… female
2 2006-07-23 21:52:30 13275 -73.9 40.7 Weather Ex… 0.576 neutr… undef…
3 2006-08-29 01:52:30 23160 NA NA Weather Ex… 0.500 neutr… male
4 2006-11-07 02:46:52 57868 NA NA Weather Ex… 0.0328 neutr… male
5 2006-12-17 19:43:09 1278023 -79.8 36.1 Weather Ex… -0.565 denier male
6 2006-12-21 01:39:01 1455543 -122. 38.0 Weather Ex… 0.651 neutr… male
7 2006-12-31 10:47:25 1893063 -1.90 52.5 Weather Ex… 0.671 neutr… male
8 2007-01-06 17:36:51 2266613 -73.9 40.7 Weather Ex… -0.568 neutr… male
9 2007-01-06 18:08:03 2268453 NA NA Weather Ex… -0.202 neutr… male
10 2007-01-07 22:44:24 2332873 NA NA Weather Ex… -0.395 neutr… male
# ℹ 28 more rows
# ℹ 2 more variables: temperature_avg <dbl>, aggressiveness <chr>
select()
– Selects specific columnsThe select()
function helps you choose specific columns from a dataset. Here, we’ll select the created_at
and topic
columns:
%>% select(created_at, topic) CC_Tweets
# A tibble: 100 × 2
created_at topic
<dttm> <chr>
1 2006-06-06 16:06:42 Weather Extremes
2 2006-07-23 21:52:30 Weather Extremes
3 2006-08-29 01:52:30 Weather Extremes
4 2006-11-07 02:46:52 Weather Extremes
5 2006-11-27 14:27:43 Importance of Human Intervantion
6 2006-11-29 23:21:04 Seriousness of Gas Emissions
7 2006-12-11 22:08:14 Ideological Positions on Global Warming
8 2006-12-14 01:39:10 Ideological Positions on Global Warming
9 2006-12-17 19:43:09 Weather Extremes
10 2006-12-21 01:39:01 Weather Extremes
# ℹ 90 more rows
mutate()
– Adds new variables or modifies existing onesThe mutate()
function allows you to create new variables or modify existing ones. For example, let’s create a new variable sentiment_category
based on the sentiment
scores. dplyr
also offers a single-line function for multiple conditions, case_when()
. Here, we use case_when()
to categorize sentiment values. Tweets with a sentiment score less than or equal to -0.1 are labeled as “Negative,” those between -0.1 and 0.1 as “Neutral,” and those greater than 0.1 as “Positive.”
<- CC_Tweets %>% mutate(sentiment_category = case_when(
CC_Tweets <= -0.1 ~ "Negative",
sentiment > -0.1 & sentiment <= 0.1 ~ "Neutral",
sentiment > 0.1 ~ "Positive"
sentiment ))
dplyr
package to show tweets from female from CC_Tweets
.CC_Tweets
called pos_sentiment
, and the values are TRUE
if the sentiment is positive, and FALSE
if the sentiment is negative, by a function in dplyr
.