R Programming Language Concepts I

Data Types and Functions

Author

Pei-Hsun Hsieh

How to use this guide

This guide has two files, a .html file and a .qmd file. If you are not familiar with R language yet, you can open the .html file to see what you can expect from running code chunks in the .qmd file.

The note is designed to be complementary in-person sessions. Please follow step-by-step during class.

How to Execute R Code

You can execute R code in RStudio in several ways:

  1. Type your code directly into the Console and press Enter.
  2. Write your code in an .R script, highlight it, and click the Run button at the top-right.
  3. Run code within code chunks in a Quarto or R Markdown file.

How to Learn a Programming Language

Learning a programming language is a valuable skill that can greatly enhance your ability to perform computational text analysis. Here are some tips to help you get started:

  1. Using programming languages involves utilizing fundamental tools to achieve tasks, rather than relying on one-stop, off-the-shelf solutions.
    • While it has a steeper learning curve compared to off-the-shelf solutions, it offers much more flexibility.
    • R and Python, the two programming languages covered in this course, provide flexibility and an ecosystem of modules (a.k.a. packages) with useful intermediate tools, so you don’t have to start from scratch.
  2. It is more important to understand concepts and logic rather than memorizing code. There is no need to memorize any code, as programmers use cheat sheets (and A.I. nowadays).
    • Focus on understanding how and why the code works. This will help you adapt to new problems and find solutions more effectively.
  3. The most important goal of coding is to write code that works as expected. There is usually more than one way to code to achieve a certain result, and there is no single best way. However, keep in mind that better code should be (1) adaptable and (2) neat. It is difficult to write adaptable and neat code as a beginner, but you will get there with more experience.

Quarto

Quarto is a tool that combines code chunks with Markdown syntax, allowing you to execute code and generate output files in various formats, such as HTML, MS Word, and PDF. In this class, you can use Quarto to run code chunks within class notes. However, for submitting exercises and assignments, you are not required to use Quarto. For more information about Quarto, visit their website.

Objects, Data Types, and Functions

Objects

In programming, data is stored in memory as objects. An object in R has a name and an assigned value. In R, the assignment operator <- is used to assign values to objects. For example, you can create an object a and assign it the value of 9:

a <- 9

To view the value stored in an object, simply type the object’s name and run the code:

a
[1] 9
Warning

Object names can be a combination of letters and numbers, but certain names are reserved by the R language and cannot be used, such as if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, and numeric values without letters. It’s also good practice to avoid naming objects after existing functions or package names.

Functions

A function is a block of code that performs a specific task. Functions have a name, may take arguments (inputs), and may return a value (output).

In R, most functions are called by typing the function name followed by parentheses, with any inputs (arguments) placed inside the parentheses.

For example, the sqrt() function in R calculates the square root of a number. It takes one input (the number you want the square root of) and returns the square root. Here’s how you can calculate the square root of 9:

sqrt(9)
[1] 3

The result is 3.

You can also calculate the square root of the object a we created earlier:

sqrt(a)
[1] 3

Since we assigned the value 9 to a, the output will also be 3.

Additionally, you can store the result of sqrt(a) in a new object:

sqrt_of_a <- sqrt(a)
sqrt_of_a
[1] 3

Now, sqrt_of_a holds the value 3.

If you want to learn how to use a function in R, you can look up its documentation by typing ? followed by the function name. For example, to access the documentation for sqrt(), you would use:

?sqrt

This will display detailed information about the sqrt() function, including its arguments, usage, and examples.

Note

A function is also an object in R, but instead of storing a value, it stores lines of code.

Data Types of a Single Value

Data types refer to the types of values an object can hold. For example, 5 is a numeric value. Data types are important because certain operations and functions are only applicable to specific data types. The basic data types in R are:

Numeric

Values like 3 or 5.5. + For example, the a object we created earlier:

a
[1] 9

You can use the class() function to check the data type of an object:

class(a)
[1] "numeric"

In this case, the type is numeric.

You can use R as a calculator with numeric values.

a+3
[1] 12
a-1
[1] 8
a*10
[1] 90
a/2
[1] 4.5
a%/%2 # quotient
[1] 4
a%%2 # remainder
[1] 1

Boolean

Values that can only be TRUE or FALSE. Referred to as “logical” in R.

b <- TRUE
b
[1] TRUE

You can verify the data type using class():

class(b)
[1] "logical"

This shows that b is of logical (Boolean) type. Note that R is case-sensitive, so TRUE and FALSE are distinct from true and false. The latter would result in an error since true/false are not recognized Boolean values in R:

c <- true
Error in eval(expr, envir, enclos): object 'true' not found
Tip

Error messages can provide useful information. In the error above, R displays “Error: object ‘true’ not found” because true is not a valid value in R—instead, it’s interpreted as an object name, which was not created.

If we assign a new value to an existing object, the previous value will be replaced:

b <- FALSE
b
[1] FALSE

Logical operators are used for working with Boolean values (TRUE or FALSE). Here are some important ones:

  • Negation (Logical NOT): !
    The negation operator (!) reverses a Boolean value.
    Logical NOT of TRUE is FALSE, and logical NOT of FALSE is TRUE.
!TRUE
[1] FALSE
!FALSE
[1] TRUE
  • AND: &
    The AND operator (&) returns TRUE only if both Boolean values are TRUE. If either or both are FALSE, it returns FALSE.
TRUE & TRUE   # TRUE
[1] TRUE
TRUE & FALSE  # FALSE
[1] FALSE
FALSE & TRUE  # FALSE
[1] FALSE
FALSE & FALSE # FALSE
[1] FALSE
  • OR: |
    The OR operator (|) returns TRUE if at least one of the Boolean values is TRUE. It only returns FALSE if both are FALSE.
TRUE | TRUE   # TRUE
[1] TRUE
TRUE | FALSE  # TRUE
[1] TRUE
FALSE | TRUE  # TRUE
[1] TRUE
FALSE | FALSE # FALSE
[1] FALSE

String

Referred to as “character” in R. Any value surrounded by quotation marks, such as "Philadelphia", "University of Pennsylvania", "5", or "true".

city <- "Philadelphia"
city
[1] "Philadelphia"
class(city)
[1] "character"

Even if you enclose numbers or Boolean values in quotation marks, they will be treated as strings:

d <- "5"
d
[1] "5"
class(d)
[1] "character"

Missing values

A missing value in R is NA. NA is not a string or a numeric value, but an indicator of missingness.

e <- NA
e
[1] NA

The function is.na() in R is used to identify missing values (NA) in an object. It returns TRUE if an element is NA and FALSE otherwise. Here’s how you can use it:

is.na(d)
[1] FALSE
is.na(e)
[1] TRUE

Comparison Operators

1. Equality Operator (==)

The equality operator == checks if two values or objects are equal in R. It returns TRUE if they are the same and FALSE if they are not.

5 == 5  # Since both values are 5, the result is TRUE.
[1] TRUE

Create variables a1, a2, and a3 with values 3, 3, and 5, respectively:

a1 <- 3
a2 <- 3
a3 <- 5

Check if a1 and a2 are equal:

a1 == a2  # Returns TRUE because both are 3.
[1] TRUE

Check if a2 and a3 are equal:

a2 == a3  # Returns FALSE because the values differ.
[1] FALSE

The operator also works with strings:

"apple" == "apple"  # TRUE
[1] TRUE
"apple" == "banana"  # FALSE
[1] FALSE

2. Inequality Operator (!=)

The inequality operator != returns TRUE if two values or objects are not equal, and FALSE otherwise.

a1 != a2  # Returns FALSE because both are 3.
[1] FALSE
a2 != a3  # Returns TRUE because the values differ.
[1] TRUE

3. Greater Than (>) and Less Than (<) Operators

These operators compare two values to determine relational precedence. > returns TRUE if the left operand is larger than the right operand. < returns TRUE if the left operand is smaller.

Check if a2 is less than a3:

a2 < a3  # TRUE because 3 is less than 5.
[1] TRUE

Check if a2 is greater than a3:

a2 > a3  # FALSE because 3 is not greater than 5.
[1] FALSE

Exercise

  1. Check whether 2 * 10 is equal to 4 * 5 using R’s math and logical operators:
  1. Check whether “PA” is equal to “pa” in R:

Why Are Data Types Important?

Some functions can only be applied to specific data types. For example, mathematical operations can only be performed on numeric values. For instance, 5 and 3 are numbers and can be added together:

5 + 3
[1] 8

Similarly, a number can have a square root, as we showed before:

sqrt(9)
[1] 3

However, you cannot perform addition or square root operations on strings:

"University" + "Pennsylvania"
Error in "University" + "Pennsylvania": non-numeric argument to binary operator
sqrt(city)  # 'city' is assigned the string value "Philadelphia"
Error in sqrt(city): non-numeric argument to mathematical function

We will cover more operations for strings in later lectures.

Exercise

  1. Go to the Penn facts page. Create an object called N_students that stores the number of total students, and an object called N_faculty that stores the number of total faculty.
  1. Calculate the total student-faculty ratio using the two objects you just created. Assign this value to an object called SF_ratio.
  1. Use a function to check what data type the object SF_ratio is.

Data Type Conversion

In R, you can convert an object’s data type to another, but certain rules apply:

  • Boolean to Numeric or String:
    • TRUE becomes 1 and FALSE becomes 0 when converted to numeric:
as.numeric(TRUE)
[1] 1
as.numeric(FALSE)
[1] 0
Tip

In R, Boolean values TRUE and FALSE can be directly used in mathematical operations as 1 and 0, respectively, without needing explicit conversion.

  • TRUE and FALSE become “TRUE” and “FALSE” when converted to strings:
as.character(TRUE)
[1] "TRUE"
as.character(FALSE)
[1] "FALSE"
  • Numeric to Boolean or String:
    • Any non-zero number converts to TRUE as a Boolean, while 0 converts to FALSE:
as.logical(5)
[1] TRUE
as.logical(-1)
[1] TRUE
as.logical(0)
[1] FALSE
  • A number becomes a string by placing it inside quotation marks:
as.character(5)
[1] "5"
as.character(-1)
[1] "-1"
  • String to Numeric or Boolean:
    • Only number strings and the strings "TRUE"/"FALSE" can be converted to their respective numeric or Boolean values. Otherwise, R returns NA (representing a missing value):
as.numeric("-1")
[1] -1
as.logical("FALSE")
[1] FALSE
as.numeric("PA")  # NA is returned here
Warning: NAs introduced by coercion
[1] NA

Data Types of a Group of Values

We have introduced data types that store a single value, but there are also data types that can store multiple values in one object.

Vectors

A vector is a one-dimensional object containing multiple values of the same data type (similar to a vector in math). To create a vector, use the c() function, separating the values you want to assign with commas. For example, to create a vector called v1 that has the values 1, 3, and 5:

v1 <- c(1,3,5)
v1
[1] 1 3 5

Or for a vector of course names:

PPE_courses <- c("Strategic Reasoning", "Behavioral Economics and Psychology", 
                 "Computational Text Analysis for Social Sciences", 
                 "Cooperation: Addressing Contemporary Societal Challenges", 
                 "Racial and Ethnic Politics", "Corruption", "Obedience", 
                 "Modeling Choice Behavior", "Cooperative Altruism")

Values in a vector are ordered, meaning you can access a value by its index. For example, to access the second value in the v1 vector:

PPE_courses[2]
[1] "Behavioral Economics and Psychology"

We can also access multiple values in a vector. For example, to access the second to the fourth values:

PPE_courses[2:4]
[1] "Behavioral Economics and Psychology"                     
[2] "Computational Text Analysis for Social Sciences"         
[3] "Cooperation: Addressing Contemporary Societal Challenges"

Each vector can only store one type of value at a time. For example, if you try to store both numbers and strings in a vector, the numbers are coerced into strings:

v_numbers_and_strings <- c(19103, 19104, TRUE, "ZIP")
v_numbers_and_strings
[1] "19103" "19104" "TRUE"  "ZIP"  

Useful functions for creating vectors:

  • To create a vector with consecutive integers, use the format start:end. For example, a vector from 1 to 10:
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
  • To create a vector with a start, end, and interval, use seq():
seq(1,10,2)
[1] 1 3 5 7 9
  • To create a vector with repeated values, use rep(). For example, for a vector with 10 values, all set to 2:
rep(2,10)
 [1] 2 2 2 2 2 2 2 2 2 2
  • To get the number of elements of a vector, use length():
length(PPE_courses)
[1] 9

Vector Operations

Vectors can be used in arithmetic expressions, where operations are performed element by element. Let’s create a vector from 1 to 10 and call it v2:

v2 <- 1:10
v2
 [1]  1  2  3  4  5  6  7  8  9 10

You can double all the elements in v2 by using v2 * 2:

v2 * 2
 [1]  2  4  6  8 10 12 14 16 18 20

If you want to check which numbers in v2 are even, you can use the modulo operator %% combined with ==:

v2 %% 2 == 0
 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

Let’s break down the code above. First, it calculates the remainder when each element of v2 is divided by 2:

step1 <- v2 %% 2
step1
 [1] 1 0 1 0 1 0 1 0 1 0

This returns a numeric vector where odd numbers have a remainder of 1, and even numbers have a remainder of 0. Next, we use the == operator to check which elements are equal to 0:

step2 <- step1 == 0
step2
 [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

This returns a Boolean vector, where the even numbers are TRUE and the odd numbers are FALSE.

You can use this Boolean vector to select the even numbers from v2:

v2[v2 %% 2 == 0]
[1]  2  4  6  8 10

Exercise

  1. Create a vector containing all multiples of 3 between 1 and 100. Use two different approaches to accomplish this.
# Method 1
# Method 2
  1. Count the multiples of 3 between 1 and 100 by finding the length of the vector you created in the first task.

Matrix

A matrix is a two-dimensional object containing multiple values of the same data type (similar to a matrix in math). To create a matrix, use the matrix() function. Let’s first check the documentation to see how to use it:

?matrix

The usage section shows:

matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

This means the function has the following inputs: 1. data: A data vector. 2. nrow: The desired number of rows. 3. ncol: The desired number of columns. 4. byrow: Whether to fill the matrix by rows. 5. dimnames: Names for the dimensions.

If an argument is followed by =, it is optional and has a default value. Arguments without = are required.

For example, to create a matrix: \(\begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{bmatrix}\)

my_matrix <- matrix(c(1,2,3,4,5,6,7,8), nrow=2, byrow=TRUE)
my_matrix
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8

Or equivalently:

my_matrix <- matrix(c(1,2,3,4,5,6,7,8), ncol=4, byrow=TRUE)
my_matrix
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8

You can access the values by their index. For example, to access the value in the second row and the third column:

my_matrix[2,3]
[1] 7

If you want to access the whole row, just leave the column index blank:

my_matrix[2,]
[1] 5 6 7 8

Similarly, if you want to access the whole column, leave the row index blank:

my_matrix[,3]
[1] 3 7

You can also choose several indexes with a vector of indices. For example, to access the values in the second row and the first and the third column:

my_matrix[2,c(1,3)]
[1] 5 7

Exercise

Please access the values in the first row from the second to the fourth column from my_matrix with : function to create a vector with consecutive integers from 2 to 4.

List

A list in R can store multiple values of different types. For example:

unnamed_list <- list("PPE", 4000, TRUE, c("Monday", "Wednesday"))
unnamed_list
[[1]]
[1] "PPE"

[[2]]
[1] 4000

[[3]]
[1] TRUE

[[4]]
[1] "Monday"    "Wednesday"

You can access an element by its index using double square brackets:

unnamed_list[[2]]  # Returns 4000
[1] 4000

Unlike vectors, the data types in a list are not coerced:

class(unnamed_list[[2]])  # Numeric
[1] "numeric"
class(unnamed_list[[3]])  # Logical
[1] "logical"

A list is useful for grouping different data types together. The following is an example of a named list with two components, the number of PPE postdocs and their names.

PPE_postdocs <- list(N=4, names=c("Jaron Cordero", "Pei-Hsun Hsieh", "Jair Moreira", "Raj Patel"))
PPE_postdocs
$N
[1] 4

$names
[1] "Jaron Cordero"  "Pei-Hsun Hsieh" "Jair Moreira"   "Raj Patel"     

Access elements in a named list:

PPE_postdocs$N
[1] 4
PPE_postdocs$names
[1] "Jaron Cordero"  "Pei-Hsun Hsieh" "Jair Moreira"   "Raj Patel"     
PPE_postdocs$names[2]
[1] "Pei-Hsun Hsieh"

Exercise

Create a list with two components:

  1. The number of credits you are taking.
  2. The names of the courses you are enrolled in.

Data Frames

Data frames are one of the most commonly used data types in R, designed for statistical computing. They differ from matrices in that each column of a data frame can hold different data types.

To create a data frame, use the data.frame() function. For example, let’s create a data frame for a course list:

my_course <- data.frame(
  Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges"),
  Section = c("PPE 4000-302", "PPE 4600-301"),
  max_enroll = c(6, 18)
)
my_course
                                       Course      Section max_enroll
1                 Computational Text Analysis PPE 4000-302          6
2 Addressing Contemporary Societal Challenges PPE 4600-301         18

Here, Course and Section are character vectors, and max_enroll is a numeric vector.

To access specific elements similar to a matrix, use indexing:

my_course[2, 1]  # Second row, first column
[1] "Addressing Contemporary Societal Challenges"
my_course[2, ]   # Entire second row
                                       Course      Section max_enroll
2 Addressing Contemporary Societal Challenges PPE 4600-301         18
my_course[, 1]   # Entire first column
[1] "Computational Text Analysis"                
[2] "Addressing Contemporary Societal Challenges"

To access columns using the $ operator:

my_course$Course       # Accesses all course names
[1] "Computational Text Analysis"                
[2] "Addressing Contemporary Societal Challenges"
my_course$Course[2]    # Accesses the second course name
[1] "Addressing Contemporary Societal Challenges"

There are two primary methods to determine the number of rows and columns in a data frame in R:

  1. Using the dim() function: This function returns a vector where the first element is the number of rows and the second element is the number of columns in the data frame.
dim(my_course)
[1] 2 3
  1. Using nrow() and ncol() functions: These functions return the number of rows and the number of columns respectively.
nrow(my_course)  # Returns the number of rows
[1] 2
ncol(my_course)  # Returns the number of columns
[1] 3

All columns in a data frame must have the same length. If you attempt to create a data frame with uneven column lengths, R will display an error:

my_course2 <- data.frame(
  Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges", "Independent Study"),
  Section = c("PPE 4000-302", "PPE 4600-301"),
  max_enroll = c(6, 18)  # Error due to missing third enrollment
)
Error in data.frame(Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges", : arguments imply differing number of rows: 3, 2

To resolve this, include NA for missing entries:

my_course2 <- data.frame(
  Course = c("Computational Text Analysis", "Addressing Contemporary Societal Challenges", "Independent Study"),
  Section = c("PPE 4000-302", "PPE 4600-301", "PPE-3999"),
  max_enroll = c(6, 18, NA)
)
my_course2
                                       Course      Section max_enroll
1                 Computational Text Analysis PPE 4000-302          6
2 Addressing Contemporary Societal Challenges PPE 4600-301         18
3                           Independent Study     PPE-3999         NA

To add a new column:

my_course2$day_of_week <- c("MW", "M", NA)

To append rows using rbind():

my_course <- rbind(my_course, data.frame(Course = "Independent Study", Section = "PPE-3999", max_enroll = NA))
my_course
                                       Course      Section max_enroll
1                 Computational Text Analysis PPE 4000-302          6
2 Addressing Contemporary Societal Challenges PPE 4600-301         18
3                           Independent Study     PPE-3999         NA

Exercise

  1. What type of objects do you think will be the most appropriate data structure for tracking both the number of movies watched this year and the titles of each movie?

    1. Visit the Penn fact page and create a data frame for the undergraduate school populations.
  1. Use comparison operators to select and display rows from the data frame where the schools have more than 1000 students.