Intro to R for Data Analysis

Oct 2025

In this workshop

  • What is R, and why should I use it? / Is R right for me?
  • Key terms
  • Mini worked example of data analysis project in R
    • Importing data
    • Data cleaning/data management
    • Data analysis

Motivation

What we have: Data

Suppose we have data we want to “work with”, and our data is arranged like the following:

  • First row contains variable names
  • Subsequent rows contain observations (1 row = 1 respondent or unit)
  • Each columns represents a variable (1 column = 1 distinct variable)
  • Common file formats for this type of data: CSV, TXT, Excel, SPSS .sav, SAS .sas7bdat, Stata *.dta, …

Screenshot of structured data with first row containing variable names and subsequent rows containing data values


What can we do with this data?

What we want: Statistics & Graphs

Statistics

  SleepTime
Predictors Estimates std. Error p
(Intercept) 9.73 0.25 <0.001
StudyTime -0.30 0.04 <0.001
Rank [So] -0.74 0.27 0.007
Rank [Jr] -0.47 0.32 0.149
Rank [Sr] 0.34 0.52 0.519
Observations 359
R2 / R2 adjusted 0.263 / 0.255

What we really want: Communication

What we really want is to be able to understand our data and communicate our findings to others (in whatever format that make take):

Screenshot of an interactive data dashboard.

Screenshot of a peer-reviewed publication.

What we really want: Reproducibility

And we want to do these things in a way that’s consistent, transparent, and repeatable:

library(haven)
library(sjPlot)

#' Read data from SPSS and apply labels
df <- haven::read_spss("./data/Sample_Dataset_2019.sav")
df$Rank <- factor(df$Rank, 
                  levels=1:4, 
                  labels=c("Fr", "So", "Jr", "Sr"))

#' Fit linear regression model
mylm <- lm(SleepTime ~ StudyTime + Rank, data=df)

#' Generate nice table with parameter estimates and p-values
sjPlot::tab_model(mylm)

R and RStudio can help you achieve all of these things

About R

R software logo

What it is

A programming language designed for statistical data analysis

An open source statistical software program

A community of data scientists and practitioners

Operating systems

Windows, Mac, Linux

Cost

Free!

What makes R special: R packages

R can do a lot, but there are new analytic methods coming out all the time. That’s where R packages – and the power of the R community – comes in.

  • R packages are user-created modules of code intended to fulfill a narrow task or purpose
    • e.g. Reading Excel, SAS, Stata, or SPSS data files into R
    • e.g. Fitting structural equation models
    • e.g. Creating special types of graphs, like choropleth maps
  • Packages are shared through repositories, which are like moderated archives for sharing of code

How to use R

Option 1: Run R/RStudio Locally

R software logo

To install R:

cran.r-project.org

RStudio software logo

To install RStudio:

rstudio.com

Option 2: Run R/RStudio in the Cloud

Posit Cloud logo

Create account:

posit.cloud

The R user interface

Screenshot of R user interface. Left side is the R console window, while the right side is the R Editor window for writing an R script.
  • The console is the “command line” interface part of R – type code into it, press enter, get results back
  • The console is fine for “playing around”, but most of the time, you’ll put everything into an R script: a file containing executable R code.

The RStudio User Interface

Screenshot of RStudio user interface. It is divided into four quadrants: R script editor (upper left), environment and history (upper right), R console (lower left), and Help panel (lower right).
  • Using RStudio is completely optional, but it has many helpful features to make writing and managing R code easier

Running Code in R or RStudio


Console

  1. Type or copy/paste code into the console
  2. Press the Enter key on your keyboard

Code in an R script

  1. Highlight the code you want to run (or put your cursor on the line you want to run)
  2. Click the “Run” button or press Ctrl + Enter on your keyboard (for Mac: press Command + Enter)

R Syntax

Unlike other stat softwares, R syntax requires us to name both our dataset objects (called dataframes) and the variables inside those datasets (called vectors).

professions <- data.frame(job=c("Engineer", "Doctor"), 
                     salary=c(80000, 300000))

Here, we have a dataframe called professions, which contains two variables: job and salary. If we want to access the individual variables within the dataframe, we give the dataframe name followed by the $ operator, then the variable name:

professions$job
[1] "Engineer" "Doctor"  

R Syntax Behavior

For most actions, we will use functions: named “actions” that take some input(s) and return some output(s).

mean(professions$salary)
[1] 190000
toupper(professions$job)
[1] "ENGINEER" "DOCTOR"  

Some functions have more than one argument. Arguments are inputs to the function that change its behavior.

log(professions$salary, base=10)
[1] 4.903090 5.477121

Worked Example

Today’s sample data

Tutorial sample data:

Research questions we’ll consider today:

  • Do students who study more get less sleep?
  • Do underclassmen (freshmen and sophomores) get less sleep than upperclassmen (juniors and seniors)?

The R Data Analysis Workflow

  1. Read/import data into R
  • Determine if special R packages needed to read data
  1. Data management
  • Check data quality
  • Compute new variables
  • Filter rows
  • Select columns
  1. Data analysis
  • Descriptive statistics
  • Plots
  • Statistical tests and models

Working in R, part 1 | Importing text-based data

Step 1: Get data into R

  • Data frames are R’s object type for traditional “data sets”
  • Base R function to import text and CSV files as a data frame: read.table or read.csv
mydata <- read.csv(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.csv")

This code creates an object named mydata containing the data from this file as a data frame.

Step 1: Get data into R

What if the data isn’t in CSV or plaintext format?

  • Oftentimes, it is necessary to find an R package that can read other data formats. Some packages I would recommend are:
    • Excel: readxl
    • SAS, SPSS, and Stata 13+: haven
  • To use an R package, we must first install it from a host repository
    • This is done by running the function install.packages(...) in the console (only need to do this once)
install.packages("haven")
  • Once a package has been installed, it must be loaded using the library(...) function at the start of your R session
library("haven")
mydata <- haven::read_spss(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.sav")

Step 2: Look at the data we imported into R

How do we know if our import was successful?

The following functions take a data frame as an argument, and return previews or summaries about that data frame:

  • View the dataframe as a spreadsheet using View(mydata)
  • Print variable types using function str(mydata) (str is short for structure)
  • Print minimum, maximum, median, missing values for all variables in dataframe using summary(mydata)
  • Print first 5 rows of dataframe using head(mydata)
  • Print last 5 rows of dataframe using tail(mydata)
  • Print names of variables in dataframe using names(mydata)

Step 3: Access variables within our dataset

  • Some functions expect a dataframe; other functions expect a vector of observations – that is, a single, specific column from a dataframe
  • Variables are simply a named vector inside our dataframe

When we need to access a column of a dataframe as a vector, we use the $ operator to extract it:

mydata$SleepTime

This syntax can be combined with the assignment operator to add new variables to a dataframe, or edit existing variables in a dataframe:

mydata$SleepTime_minutes <- mydata$SleepTime*60

Step 4: Subsetting rows or columns of our dataset

Normally when we want to access specific rows of a dataframe, what we really want is to filter our data by some condition. We can do this using function subset():

freshmen_only <- subset(mydata, subset = Rank==1)

The first argument to subset is the name of the dataframe. The second argument is a logical condition for which rows to keep (here, Rank==1, i.e. keep only freshman).

If we instead want to drop or keep specific columns of our dataset, we can also use the subset function, but use its select argument:

grade_data <- subset(mydata, select = English:Science)
fitness_data <- subset(mydata, select = c(Athlete, Smoking, Mile))
data_dropbday <- subset(mydata, select = -bday)

Working in R, part 2 | Summary statistics

Summary statistics for continuous numeric variables

  • mean(), sd(), min(), max(), median(), sum()
    • The na.rm=TRUE argument
  • Base R graphics: hist(), boxplot()

Summary statistics for categorical variables

  • table()
  • addmargins()
  • prop.table()
  • Base R graphics: barplot(table(...))

Working in R, part 3 | Data analysis

Examples

  • Correlation using cor()
  • Linear regression using lm()

Questions

Appendix A | R version used for this workshop

R version used for this workshop

devtools::session_info()[[1]]
 setting  value
 version  R version 4.5.1 (2025-06-13 ucrt)
 os       Windows 11 x64 (build 26100)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2025-10-28
 pandoc   3.6.3 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
 quarto   NA @ C:\\Users\\kyeager4\\AppData\\Local\\Programs\\Quarto\\bin\\quarto.exe

R package versions used for this workshop

devtools::package_info(c("haven"), dependencies=FALSE)
package ondiskversion loadedversion path loadedpath attached is_base date source md5ok library
haven haven 2.5.5 2.5.5 C:/Users/kyeager4/AppData/Local/R/win-library/4.5/haven C:/Users/kyeager4/AppData/Local/R/win-library/4.5/haven TRUE FALSE 2025-05-30 CRAN (R 4.5.1) TRUE C:/Users/kyeager4/AppData/Local/R/win-library/4.5