Intro to R for Data Analysis

Apr 2024

In this workshop

  • What is R, and why should I use it?
  • Key terms
  • Worked example of the R workflow
    • Setting up RStudio project
    • Importing data
    • Data cleaning/data management
    • Data analysis

Motivation

What we have: Data

Suppose we have data we want to “work with”, and our data is arranged like the following:

  • First row contains variable names
  • Subsequent rows contain observations (1 row = 1 respondent or unit)
  • Each columns represents a variable (1 column = 1 distinct variable)
  • Common file formats for this type of data: CSV, TXT

Screenshot of structured data with first row containing variable names and subsequent rows containing data values


What can we do with this data?

What we want: Statistics


  SleepTime
Predictors Estimates CI p
(Intercept) 9.73 9.24 – 10.23 <0.001
StudyTime -0.30 -0.37 – -0.23 <0.001
Rank [So] -0.74 -1.27 – -0.20 0.007
Rank [Jr] -0.47 -1.10 – 0.17 0.149
Rank [Sr] 0.34 -0.69 – 1.36 0.519
Observations 359
R2 / R2 adjusted 0.263 / 0.255

What we want: Graphs

What we really want: Communication

What we really want is to be able to understand our data and communicate our findings to others (in whatever format that make take):

What we really want: Reproducibility

And we want to do these things in a way that’s consistent, transparent, and repeatable:

library(haven)
library(sjPlot)

#' Read data from SPSS and apply labels
df <- haven::read_spss("./data/Sample_Dataset_2019.sav")
df$Rank <- factor(df$Rank, 
                  levels=1:4, 
                  labels=c("Fr", "So", "Jr", "Sr"))

#' Fit linear regression model
mylm <- lm(SleepTime ~ StudyTime + Rank, data=df)

#' Generate nice table with parameter estimates and p-values
sjPlot::tab_model(mylm)

R and RStudio can help you achieve all of these things

About R

What it is

A programming language designed for statistical data analysis

An open source statistical software program

A community of data scientists and practitioners

Operating systems

Windows, Mac, Linux

Cost

Free!

How to use R

Option 1: Run R/RStudio Locally

To install R:

cran.r-project.org

To install RStudio:

rstudio.com

Option 2: Run R/RStudio in the Cloud

Create account:

posit.cloud

The R user interface

  • The console is the “command line” interface part of R – type code into it, press enter, get results back
  • The console is fine for “playing around”, but most of the time, you’ll put everything into an R script – a file containing executable R code

The RStudio User Interface

  • Using RStudio is completely optional, but it has many helpful features to make writing and managing R code easier

R syntax

Necessary vocabulary before we can get to the “good stuff”

Basic usage

  • Type commands into the R console, press “enter”, see results
  • R is a great basic calculator! Try doing some basic arithmetic like addition (+), subtraction (-), multiplication (*), division (/), powers/exponents (** or ^)
12+13
[1] 25
12*13
[1] 156
12/13
[1] 0.9230769
12**2
[1] 144

Objects

  • To really benefit from R, we can’t type and re-type every operation by hand: we need to be able to store and retrieve the results of calculations.
  • Objects are named “things” within R that we can access or manipulate.
  • Objects are created using the assignment operator: <-
object_name <- value_to_assign_to_object

Functions

  • Similarly, functions are named “actions” or formulas we can perform in R.
  • Functions usually “look like” a word followed by parentheses, with “stuff” in between the parentheses:
mean(...)
  • When you start using R, you’ll mostly be working with pre-existing functions from base R or R packages.

Types of Objects: Vectors

  • Vectors are a special type of object that represent a set of numbers (or other data values, like letters).
    • Created using the c() function:
new_vector <- c(3, 1, 1, 0)
print(new_vector)
[1] 3 1 1 0


new_vector_char <- c("a", "b", "c", "d")
print(new_vector_char)
[1] "a" "b" "c" "d"

Types of Objects: Dataframes

  • Dataframes are R’s representation of the “grid-like” data structure seen earlier
    • Each column within a dataframe is technically a vector (most of the time)
  • We can create small dataframes using the data.frame() function, but most of the time we’ll create them by importing an external data file
mydf <- data.frame(new_vector=c(3, 1, 1, 0),
                   new_vector_letters=c("a", "b", "c", "d"))
print(mydf)
  new_vector new_vector_letters
1          3                  a
2          1                  b
3          1                  c
4          0                  d

Putting it together: Doing arithmetic using objects

We can use functions and arithmetic operators on our named objects instead of having to re-write the original values every time we do a calculation:

new_object <- 5
print(new_object)
[1] 5


new_vector <- c(3, 1, 1, 0)
print(new_vector)
[1] 3 1 1 0


new_vector + new_object
[1] 8 6 6 5

Putting it together: Using functions on objects

Example: Compute the mean of new_vector:

#Reminder of the values of new_vector
print(new_vector)
[1] 3 1 1 0


# Calculate the mean of new_vector
mean(new_vector)
[1] 1.25

Looking up functions

  • You can open a function’s documentation page using the help() function:
help("mean")

R packages

R can do a lot, but there are new analytic methods coming out all the time. That’s where R packages – and the power of the R community – comes in.

  • R packages are user-created modules of code intended to fulfill a narrow task or purpose
    • e.g. Reading Excel files into R
  • Packages are shared through repositories, which are like moderated archives for sharing of code

Using R packages

  • To use an R package, we must first install it from a host repository
    • This is done by running the function install.packages(...) in the console (only need to do this once)
  • Once a package has been installed, it must be loaded using the library(...) function at the start of your R session
install.packages("ggplot2")
library("ggplot2")

Worked Example

Today’s sample data

Tutorial sample data:

Research questions we’ll consider today:

  • Do students who study more get less sleep?
  • Do underclassmen (freshmen and sophomores) get less sleep than upperclassmen (juniors and seniors)?

The R Data Analysis Workflow

  1. Read/import data into R
  • Determine if special R packages needed to read data
  1. Data management
  • Check data quality
  • Compute new variables
  • Filter rows
  • Select columns
  1. Data analysis
  • Descriptive statistics
  • Plots
  • Statistical tests and models

Working in R, part 1 | Importing text-based data

Step 1: Get data into R

  • Data frames are R’s object type for traditional “data sets”
  • Base R function to import text and CSV files as a data frame: read.table or read.csv
mydata <- read.csv(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.csv")

This code creates an object named mydata containing the data from this file as a data frame.

Step 1: Get data into R

What if the data isn’t in CSV or plaintext format?

  • Oftentimes, it is necessary to find an R package that can read other data formats. Some packages I would recommend are:
    • Excel: readxl
    • SAS, SPSS, and Stata 13+: haven

Example: Reading SPSS-format dataset into R using package haven (relative file path)

library(haven)
mydata <- haven::read_spss(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.sav")

Step 2: Look at the data we imported into R

How do we know if our import was successful?

The following functions take a data frame as an argument, and return previews or summaries about that data frame:

  • View the dataframe as a spreadsheet using View(mydata)
  • Print variable types using function str(mydata) (str is short for structure)
  • Print minimum, maximum, median, missing values for all variables in dataframe using summary(mydata)
  • Print first 5 rows of dataframe using head(mydata)
  • Print last 5 rows of dataframe using tail(mydata)
  • Print names of variables in dataframe using names(mydata)

Step 3: Access variables within our dataset

  • Some functions expect a dataframe; other functions expect a vector of observations – that is, a single, specific column from a dataframe
  • Variables are simply a named vector inside our dataframe

When we need to access a column of a dataframe as a vector, we use the $ operator to extract it:

mydata$Mile

This syntax can be combined with the assignment operator to add new variables to a dataframe, or edit existing variables in a dataframe:

mydata$Mile_minutes <- mydata$Mile/60

Step 4: Subsetting rows or columns of our dataset

Normally when we want to access specific rows of a dataframe, what we really want is to filter our data by some condition. We can do this using function subset():

freshmen_only <- subset(mydata, subset = Rank==1)

The first argument to subset is the name of the dataframe. The second argument is a logical condition for which rows to keep (here, Rank==1, i.e. keep only freshman).

If we instead want to drop or keep specific columns of our dataset, we can also use the subset function, but use its select argument:

grade_data <- subset(mydata, select = English:Science)
fitness_data <- subset(mydata, select = c(Athlete, Smoking, Mile))
data_dropbday <- subset(mydata, select = -bday)

Working in R, part 2 | Summary statistics

Summary statistics for continuous numeric variables

  • mean(), sd(), min(), max(), median(), sum()
    • The na.rm=TRUE argument
  • Base R graphics: hist(), boxplot()

Summary statistics for categorical variables

  • table()
  • addmargins()
  • prop.table()
  • Base R graphics: barplot(table(...))

Working in R, part 3 | Data analysis

Examples

  • Correlation using cor()
  • Linear regression using lm()

Questions

Appendix A | R version used for this workshop

R version used for this workshop

devtools::session_info()[[1]]
 setting  value
 version  R version 4.3.3 (2024-02-29 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2024-04-11
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

R package versions used for this workshop

devtools::package_info(c("haven"), dependencies=FALSE)
package ondiskversion loadedversion path loadedpath attached is_base date source md5ok library
haven haven 2.5.4 2.5.4 C:/Users/kyeager4/AppData/Local/R/win-library/4.3/haven C:/Users/kyeager4/AppData/Local/R/win-library/4.3/haven TRUE FALSE 2023-11-30 CRAN (R 4.3.2) TRUE C:/Users/kyeager4/AppData/Local/R/win-library/4.3