<- 9 a
R Programming Language Concepts I
Data Types and Functions
How to use this guide
This guide has two files, a .html
file and a .qmd
file. If you are not familiar with R language yet, you can open the .html
file to see what you can expect from running code chunks in the .qmd
file.
The note is designed to be complementary in-person sessions. Please follow step-by-step during class.
How to Execute R Code
You can execute R code in RStudio in several ways:
- Type your code directly into the Console and press Enter.
- Write your code in an
.R
script, highlight it, and click the Run button at the top-right. - Run code within code chunks in a Quarto or R Markdown file.
How to Learn a Programming Language
Learning a programming language is a valuable skill that can greatly enhance your ability to perform computational text analysis. Here are some tips to help you get started:
- Using programming languages involves utilizing fundamental tools to achieve tasks, rather than relying on one-stop, off-the-shelf solutions.
- While it has a steeper learning curve compared to off-the-shelf solutions, it offers much more flexibility.
- R and Python, the two programming languages covered in this course, provide flexibility and an ecosystem of modules (a.k.a. packages) with useful intermediate tools, so you don’t have to start from scratch.
- It is more important to understand concepts and logic rather than memorizing code. There is no need to memorize any code, as programmers use cheat sheets (and A.I. nowadays).
- Focus on understanding how and why the code works. This will help you adapt to new problems and find solutions more effectively.
- The most important goal of coding is to write code that works as expected. There is usually more than one way to code to achieve a certain result, and there is no single best way. However, keep in mind that better code should be (1) adaptable and (2) neat. It is difficult to write adaptable and neat code as a beginner, but you will get there with more experience.
Quarto
Quarto is a tool that combines code chunks with Markdown syntax, allowing you to execute code and generate output files in various formats, such as HTML, MS Word, and PDF. In this class, you can use Quarto to run code chunks within class notes. However, for submitting exercises and assignments, you are not required to use Quarto. For more information about Quarto, visit their website.
Objects, Data Types, and Functions
Objects
In programming, data is stored in memory as objects. An object in R has a name and an assigned value. In R, the assignment operator <-
is used to assign values to objects. For example, you can create an object a
and assign it the value of 9:
To view the value stored in an object, simply type the object’s name and run the code:
a
[1] 9
Object names can be a combination of letters and numbers, but certain names are reserved by the R language and cannot be used, such as if
, else
, repeat
, while
, function
, for
, in
, next
, break
, TRUE
, FALSE
, NULL
, Inf
, NaN
, NA
, and numeric values without letters. It’s also good practice to avoid naming objects after existing functions or package names.
Functions
A function is a block of code that performs a specific task. Functions have a name, may take arguments (inputs), and may return a value (output).
In R, most functions are called by typing the function name followed by parentheses, with any inputs (arguments) placed inside the parentheses.
For example, the sqrt()
function in R calculates the square root of a number. It takes one input (the number you want the square root of) and returns the square root. Here’s how you can calculate the square root of 9:
sqrt(9)
[1] 3
The result is 3.
You can also calculate the square root of the object a
we created earlier:
sqrt(a)
[1] 3
Since we assigned the value 9 to a
, the output will also be 3.
Additionally, you can store the result of sqrt(a)
in a new object:
<- sqrt(a)
sqrt_of_a sqrt_of_a
[1] 3
Now, sqrt_of_a
holds the value 3.
If you want to learn how to use a function in R, you can look up its documentation by typing ?
followed by the function name. For example, to access the documentation for sqrt()
, you would use:
?sqrt
This will display detailed information about the sqrt()
function, including its arguments, usage, and examples.
A function is also an object in R, but instead of storing a value, it stores lines of code.
Data Types of a Single Value
Data types refer to the types of values an object can hold. For example, 5 is a numeric value. Data types are important because certain operations and functions are only applicable to specific data types. The basic data types in R are:
Numeric
Values like 3 or 5.5. + For example, the a
object we created earlier:
a
[1] 9
You can use the class()
function to check the data type of an object:
class(a)
[1] "numeric"
In this case, the type is numeric.
You can use R as a calculator with numeric values.
+3 a
[1] 12
-1 a
[1] 8
*10 a
[1] 90
/2 a
[1] 4.5
%/%2 # quotient a
[1] 4
%%2 # remainder a
[1] 1
Boolean
Values that can only be TRUE
or FALSE
. Referred to as “logical” in R.
<- TRUE
b b
[1] TRUE
You can verify the data type using class()
:
class(b)
[1] "logical"
This shows that b
is of logical (Boolean) type. Note that R is case-sensitive, so TRUE
and FALSE
are distinct from true
and false
. The latter would result in an error since true
/false
are not recognized Boolean values in R:
<- true c
Error in eval(expr, envir, enclos): object 'true' not found
Error messages can provide useful information. In the error above, R displays “Error: object ‘true’ not found” because true
is not a valid value in R—instead, it’s interpreted as an object name, which was not created.
If we assign a new value to an existing object, the previous value will be replaced:
<- FALSE
b b
[1] FALSE
Logical operators are used for working with Boolean values (TRUE
or FALSE
). Here are some important ones:
- Negation (Logical NOT):
!
The negation operator (!
) reverses a Boolean value.
Logical NOT ofTRUE
isFALSE
, and logical NOT ofFALSE
isTRUE
.
!TRUE
[1] FALSE
!FALSE
[1] TRUE
- AND:
&
The AND operator (&
) returnsTRUE
only if both Boolean values areTRUE
. If either or both areFALSE
, it returnsFALSE
.
TRUE & TRUE # TRUE
[1] TRUE
TRUE & FALSE # FALSE
[1] FALSE
FALSE & TRUE # FALSE
[1] FALSE
FALSE & FALSE # FALSE
[1] FALSE
- OR:
|
The OR operator (|
) returnsTRUE
if at least one of the Boolean values isTRUE
. It only returnsFALSE
if both areFALSE
.
TRUE | TRUE # TRUE
[1] TRUE
TRUE | FALSE # TRUE
[1] TRUE
FALSE | TRUE # TRUE
[1] TRUE
FALSE | FALSE # FALSE
[1] FALSE
String
Referred to as “character” in R. Any value surrounded by quotation marks, such as "Philadelphia"
, "University of Pennsylvania"
, "5"
, or "true"
.
<- "Philadelphia"
city city
[1] "Philadelphia"
class(city)
[1] "character"
Even if you enclose numbers or Boolean values in quotation marks, they will be treated as strings:
<- "5"
d d
[1] "5"
class(d)
[1] "character"
Missing values
A missing value in R is NA
. NA
is not a string or a numeric value, but an indicator of missingness.
<- NA
e e
[1] NA
The function is.na()
in R is used to identify missing values (NA
) in an object. It returns TRUE
if an element is NA
and FALSE
otherwise. Here’s how you can use it:
is.na(d)
[1] FALSE
is.na(e)
[1] TRUE
Comparison Operators
1. Equality Operator (==
)
The equality operator ==
checks if two values or objects are equal in R. It returns TRUE
if they are the same and FALSE
if they are not.
5 == 5 # Since both values are 5, the result is TRUE.
[1] TRUE
Create variables a1
, a2
, and a3
with values 3, 3, and 5, respectively:
<- 3
a1 <- 3
a2 <- 5 a3
Check if a1
and a2
are equal:
== a2 # Returns TRUE because both are 3. a1
[1] TRUE
Check if a2
and a3
are equal:
== a3 # Returns FALSE because the values differ. a2
[1] FALSE
The operator also works with strings:
"apple" == "apple" # TRUE
[1] TRUE
"apple" == "banana" # FALSE
[1] FALSE
2. Inequality Operator (!=
)
The inequality operator !=
returns TRUE
if two values or objects are not equal, and FALSE
otherwise.
!= a2 # Returns FALSE because both are 3. a1
[1] FALSE
!= a3 # Returns TRUE because the values differ. a2
[1] TRUE
3. Greater Than (>
) and Less Than (<
) Operators
These operators compare two values to determine relational precedence. >
returns TRUE
if the left operand is larger than the right operand. <
returns TRUE
if the left operand is smaller.
Check if a2
is less than a3
:
< a3 # TRUE because 3 is less than 5. a2
[1] TRUE
Check if a2
is greater than a3
:
> a3 # FALSE because 3 is not greater than 5. a2
[1] FALSE
Exercise
- Check whether
2 * 10
is equal to4 * 5
using R’s math and logical operators:
- Check whether “PA” is equal to “pa” in R:
Why Are Data Types Important?
Some functions can only be applied to specific data types. For example, mathematical operations can only be performed on numeric values. For instance, 5 and 3 are numbers and can be added together:
5 + 3
[1] 8
Similarly, a number can have a square root, as we showed before:
sqrt(9)
[1] 3
However, you cannot perform addition or square root operations on strings:
"University" + "Pennsylvania"
Error in "University" + "Pennsylvania": non-numeric argument to binary operator
sqrt(city) # 'city' is assigned the string value "Philadelphia"
Error in sqrt(city): non-numeric argument to mathematical function
We will cover more operations for strings in later lectures.
Exercise
- Go to the Penn facts page. Create an object called
N_students
that stores the number of total students, and an object calledN_faculty
that stores the number of total faculty.
- Calculate the total student-faculty ratio using the two objects you just created. Assign this value to an object called
SF_ratio
.
- Use a function to check what data type the object
SF_ratio
is.
Data Type Conversion
In R, you can convert an object’s data type to another, but certain rules apply:
- Boolean to Numeric or String:
TRUE
becomes 1 andFALSE
becomes 0 when converted to numeric:
as.numeric(TRUE)
[1] 1
as.numeric(FALSE)
[1] 0
In R, Boolean values TRUE
and FALSE
can be directly used in mathematical operations as 1 and 0, respectively, without needing explicit conversion.
TRUE
andFALSE
become “TRUE” and “FALSE” when converted to strings:
as.character(TRUE)
[1] "TRUE"
as.character(FALSE)
[1] "FALSE"
- Numeric to Boolean or String:
- Any non-zero number converts to
TRUE
as a Boolean, while 0 converts toFALSE
:
- Any non-zero number converts to
as.logical(5)
[1] TRUE
as.logical(-1)
[1] TRUE
as.logical(0)
[1] FALSE
- A number becomes a string by placing it inside quotation marks:
as.character(5)
[1] "5"
as.character(-1)
[1] "-1"
- String to Numeric or Boolean:
- Only number strings and the strings
"TRUE"
/"FALSE"
can be converted to their respective numeric or Boolean values. Otherwise, R returnsNA
(representing a missing value):
- Only number strings and the strings
as.numeric("-1")
[1] -1
as.logical("FALSE")
[1] FALSE
as.numeric("PA") # NA is returned here
Warning: NAs introduced by coercion
[1] NA
Data Types of a Group of Values
We have introduced data types that store a single value, but there are also data types that can store multiple values in one object.
Vectors
A vector is a one-dimensional object containing multiple values of the same data type (similar to a vector in math). To create a vector, use the c()
function, separating the values you want to assign with commas. For example, to create a vector called v1
that has the values 1, 3, and 5:
<- c(1,3,5)
v1 v1
[1] 1 3 5
Or for a vector of course names:
<- c("Strategic Reasoning", "Behavioral Economics and Psychology",
PPE_courses "Computational Text Analysis for Social Sciences",
"Cooperation: Addressing Contemporary Societal Challenges",
"Racial and Ethnic Politics", "Corruption", "Obedience",
"Modeling Choice Behavior", "Cooperative Altruism")
Values in a vector are ordered, meaning you can access a value by its index. For example, to access the second value in the v1
vector:
2] PPE_courses[
[1] "Behavioral Economics and Psychology"
We can also access multiple values in a vector. For example, to access the second to the fourth values:
2:4] PPE_courses[
[1] "Behavioral Economics and Psychology"
[2] "Computational Text Analysis for Social Sciences"
[3] "Cooperation: Addressing Contemporary Societal Challenges"
Each vector can only store one type of value at a time. For example, if you try to store both numbers and strings in a vector, the numbers are coerced into strings:
<- c(19103, 19104, TRUE, "ZIP")
v_numbers_and_strings v_numbers_and_strings
[1] "19103" "19104" "TRUE" "ZIP"
Useful functions for creating vectors:
- To create a vector with consecutive integers, use the format
start:end
. For example, a vector from 1 to 10:
1:10
[1] 1 2 3 4 5 6 7 8 9 10
- To create a vector with a start, end, and interval, use
seq()
:
seq(1,10,2)
[1] 1 3 5 7 9
- To create a vector with repeated values, use
rep()
. For example, for a vector with 10 values, all set to 2:
rep(2,10)
[1] 2 2 2 2 2 2 2 2 2 2
- To get the number of elements of a vector, use
length()
:
length(PPE_courses)
[1] 9
Vector Operations
Vectors can be used in arithmetic expressions, where operations are performed element by element. Let’s create a vector from 1 to 10 and call it v2
:
<- 1:10
v2 v2
[1] 1 2 3 4 5 6 7 8 9 10
You can double all the elements in v2
by using v2 * 2
:
* 2 v2
[1] 2 4 6 8 10 12 14 16 18 20
If you want to check which numbers in v2
are even, you can use the modulo operator %%
combined with ==
:
%% 2 == 0 v2
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
Let’s break down the code above. First, it calculates the remainder when each element of v2
is divided by 2:
<- v2 %% 2
step1 step1
[1] 1 0 1 0 1 0 1 0 1 0
This returns a numeric vector where odd numbers have a remainder of 1, and even numbers have a remainder of 0. Next, we use the ==
operator to check which elements are equal to 0:
<- step1 == 0
step2 step2
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
This returns a Boolean vector, where the even numbers are TRUE
and the odd numbers are FALSE
.
You can use this Boolean vector to select the even numbers from v2
:
%% 2 == 0] v2[v2
[1] 2 4 6 8 10
Exercise
- Create a vector containing all multiples of 3 between 1 and 100. Use two different approaches to accomplish this.
# Method 1
# Method 2
- Count the multiples of 3 between 1 and 100 by finding the length of the vector you created in the first task.
Matrix
A matrix is a two-dimensional object containing multiple values of the same data type (similar to a matrix in math). To create a matrix, use the matrix()
function. Let’s first check the documentation to see how to use it:
?matrix
The usage section shows:
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
This means the function has the following inputs: 1. data
: A data vector. 2. nrow
: The desired number of rows. 3. ncol
: The desired number of columns. 4. byrow
: Whether to fill the matrix by rows. 5. dimnames
: Names for the dimensions.
If an argument is followed by =
, it is optional and has a default value. Arguments without =
are required.
For example, to create a matrix: \(\begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{bmatrix}\)
<- matrix(c(1,2,3,4,5,6,7,8), nrow=2, byrow=TRUE)
my_matrix my_matrix
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
Or equivalently:
<- matrix(c(1,2,3,4,5,6,7,8), ncol=4, byrow=TRUE)
my_matrix my_matrix
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
You can access the values by their index. For example, to access the value in the second row and the third column:
2,3] my_matrix[
[1] 7
If you want to access the whole row, just leave the column index blank:
2,] my_matrix[
[1] 5 6 7 8
Similarly, if you want to access the whole column, leave the row index blank:
3] my_matrix[,
[1] 3 7
You can also choose several indexes with a vector of indices. For example, to access the values in the second row and the first and the third column:
2,c(1,3)] my_matrix[
[1] 5 7
Exercise
Please access the values in the first row from the second to the fourth column from my_matrix
with :
function to create a vector with consecutive integers from 2 to 4.
List
A list in R can store multiple values of different types. For example:
<- list("PPE", 4000, TRUE, c("Monday", "Wednesday"))
unnamed_list unnamed_list
[[1]]
[1] "PPE"
[[2]]
[1] 4000
[[3]]
[1] TRUE
[[4]]
[1] "Monday" "Wednesday"
You can access an element by its index using double square brackets:
2]] # Returns 4000 unnamed_list[[
[1] 4000
Unlike vectors, the data types in a list are not coerced:
class(unnamed_list[[2]]) # Numeric
[1] "numeric"
class(unnamed_list[[3]]) # Logical
[1] "logical"
A list is useful for grouping different data types together. The following is an example of a named list with two components, the number of PPE postdocs and their names.
<- list(N=4, names=c("Jaron Cordero", "Pei-Hsun Hsieh", "Jair Moreira", "Raj Patel"))
PPE_postdocs PPE_postdocs
$N
[1] 4
$names
[1] "Jaron Cordero" "Pei-Hsun Hsieh" "Jair Moreira" "Raj Patel"
Access elements in a named list:
$N PPE_postdocs
[1] 4
$names PPE_postdocs
[1] "Jaron Cordero" "Pei-Hsun Hsieh" "Jair Moreira" "Raj Patel"
$names[2] PPE_postdocs
[1] "Pei-Hsun Hsieh"
Exercise
Create a list with two components:
- The number of credits you are taking.
- The names of the courses you are enrolled in.
Data Frames
Data frames are one of the most commonly used data types in R, designed for statistical computing. They differ from matrices in that each column of a data frame can hold different data types.
To create a data frame, use the data.frame()
function. For example, let’s create a data frame for a course list:
<- data.frame(
my_course Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges"),
Section = c("PPE 4000-302", "PPE 4600-301"),
max_enroll = c(6, 18)
) my_course
Course Section max_enroll
1 Computational Text Analysis PPE 4000-302 6
2 Addressing Contemporary Societal Challenges PPE 4600-301 18
Here, Course
and Section
are character vectors, and max_enroll
is a numeric vector.
To access specific elements similar to a matrix, use indexing:
2, 1] # Second row, first column my_course[
[1] "Addressing Contemporary Societal Challenges"
2, ] # Entire second row my_course[
Course Section max_enroll
2 Addressing Contemporary Societal Challenges PPE 4600-301 18
1] # Entire first column my_course[,
[1] "Computational Text Analysis"
[2] "Addressing Contemporary Societal Challenges"
To access columns using the $
operator:
$Course # Accesses all course names my_course
[1] "Computational Text Analysis"
[2] "Addressing Contemporary Societal Challenges"
$Course[2] # Accesses the second course name my_course
[1] "Addressing Contemporary Societal Challenges"
There are two primary methods to determine the number of rows and columns in a data frame in R:
- Using the
dim()
function: This function returns a vector where the first element is the number of rows and the second element is the number of columns in the data frame.
dim(my_course)
[1] 2 3
- Using
nrow()
andncol()
functions: These functions return the number of rows and the number of columns respectively.
nrow(my_course) # Returns the number of rows
[1] 2
ncol(my_course) # Returns the number of columns
[1] 3
All columns in a data frame must have the same length. If you attempt to create a data frame with uneven column lengths, R will display an error:
<- data.frame(
my_course2 Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges", "Independent Study"),
Section = c("PPE 4000-302", "PPE 4600-301"),
max_enroll = c(6, 18) # Error due to missing third enrollment
)
Error in data.frame(Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges", : arguments imply differing number of rows: 3, 2
To resolve this, include NA
for missing entries:
<- data.frame(
my_course2 Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges", "Independent Study"),
Section = c("PPE 4000-302", "PPE 4600-301", "PPE-3999"),
max_enroll = c(6, 18, NA)
) my_course2
Course Section max_enroll
1 Computational Text Analysis PPE 4000-302 6
2 Addressing Contemporary Societal Challenges PPE 4600-301 18
3 Independent Study PPE-3999 NA
To add a new column:
$day_of_week <- c("MW", "M", NA) my_course2
To append rows using rbind()
:
<- rbind(my_course, data.frame(Course = "Independent Study", Section = "PPE-3999", max_enroll = NA))
my_course my_course
Course Section max_enroll
1 Computational Text Analysis PPE 4000-302 6
2 Addressing Contemporary Societal Challenges PPE 4600-301 18
3 Independent Study PPE-3999 NA
Exercise
What type of objects do you think will be the most appropriate data structure for tracking both the number of movies watched this year and the titles of each movie?
- Visit the Penn fact page and create a data frame for the undergraduate school populations.
- Use comparison operators to select and display rows from the data frame where the schools have more than 1000 students.