Intro to R for Data Analysis

Kristin Yeager

Kent State University Libraries

Oct 2025

In this workshop

What is R, and why should I use it? / Is R right for me?
Key terms
Mini worked example of data analysis project in R
- Importing data
- Data cleaning/data management
- Data analysis

Motivation

What we have: Data

Suppose we have data we want to “work with”, and our data is arranged like the following:

First row contains variable names
Subsequent rows contain observations (1 row = 1 respondent or unit)
Each columns represents a variable (1 column = 1 distinct variable)
Common file formats for this type of data: CSV, TXT, Excel, SPSS .sav, SAS .sas7bdat, Stata *.dta, …

Screenshot of structured data with first row containing variable names and subsequent rows containing data values

What can we do with this data?

What we want: Statistics & Graphs

Statistics

	SleepTime
Predictors	Estimates	std. Error	p
(Intercept)	9.73	0.25	<0.001
StudyTime	-0.30	0.04	<0.001
Rank [So]	-0.74	0.27	0.007
Rank [Jr]	-0.47	0.32	0.149
Rank [Sr]	0.34	0.52	0.519
Observations	359
R² / R² adjusted	0.263 / 0.255

What we really want: Communication

What we really want is to be able to understand our data and communicate our findings to others (in whatever format that make take):

Screenshot of an interactive data dashboard.

Screenshot of a peer-reviewed publication.

What we really want: Reproducibility

And we want to do these things in a way that’s consistent, transparent, and repeatable:

library(haven)
library(sjPlot)

#' Read data from SPSS and apply labels
df <- haven::read_spss("./data/Sample_Dataset_2019.sav")
df$Rank <- factor(df$Rank, 
                  levels=1:4, 
                  labels=c("Fr", "So", "Jr", "Sr"))

#' Fit linear regression model
mylm <- lm(SleepTime ~ StudyTime + Rank, data=df)

#' Generate nice table with parameter estimates and p-values
sjPlot::tab_model(mylm)

R and RStudio can help you achieve all of these things

About R

R software logo

What it is: A programming language designed for statistical data analysis

An open source statistical software program

A community of data scientists and practitioners
Operating systems: Windows, Mac, Linux
Cost: Free!

What makes R special: R packages

R can do a lot, but there are new analytic methods coming out all the time. That’s where R packages – and the power of the R community – comes in.

R packages are user-created modules of code intended to fulfill a narrow task or purpose
- e.g. Reading Excel, SAS, Stata, or SPSS data files into R
- e.g. Fitting structural equation models
- e.g. Creating special types of graphs, like choropleth maps
Packages are shared through repositories, which are like moderated archives for sharing of code
- CRAN is the “main” repo. See CRAN Task Views to explore R packages by topic
- See also: Bioconductor

How to use R

Option 1: Run R/RStudio Locally

R software logo

To install R:

cran.r-project.org

RStudio software logo

To install RStudio:

rstudio.com

Option 2: Run R/RStudio in the Cloud

Posit Cloud logo

Create account:

posit.cloud

The R user interface

Screenshot of R user interface. Left side is the R console window, while the right side is the R Editor window for writing an R script.

The console is the “command line” interface part of R – type code into it, press enter, get results back
The console is fine for “playing around”, but most of the time, you’ll put everything into an R script: a file containing executable R code.

The RStudio User Interface

Screenshot of RStudio user interface. It is divided into four quadrants: R script editor (upper left), environment and history (upper right), R console (lower left), and Help panel (lower right).

Using RStudio is completely optional, but it has many helpful features to make writing and managing R code easier

Running Code in R or RStudio

Console

Type or copy/paste code into the console
Press the Enter key on your keyboard

Code in an R script

Highlight the code you want to run (or put your cursor on the line you want to run)
Click the “Run” button or press Ctrl + Enter on your keyboard (for Mac: press Command + Enter)

R Syntax

Unlike other stat softwares, R syntax requires us to name both our dataset objects (called dataframes) and the variables inside those datasets (called vectors).

professions <- data.frame(job=c("Engineer", "Doctor"), 
                     salary=c(80000, 300000))

Here, we have a dataframe called professions, which contains two variables: job and salary. If we want to access the individual variables within the dataframe, we give the dataframe name followed by the $ operator, then the variable name:

professions$job

[1] "Engineer" "Doctor"

R Syntax Behavior

For most actions, we will use functions: named “actions” that take some input(s) and return some output(s).

mean(professions$salary)

[1] 190000

toupper(professions$job)

[1] "ENGINEER" "DOCTOR"

Some functions have more than one argument. Arguments are inputs to the function that change its behavior.

log(professions$salary, base=10)

[1] 4.903090 5.477121

Worked Example

Today’s sample data

Tutorial sample data:

Simulated survey dataset we use for our online tutorials
n=435 “college students”
Available in CSV, TXT, SAS, SPSS formats
Download from our website: libguides.library.kent.edu/SPSS/ (direct download link for CSV, direct download link for SPSS format)

Research questions we’ll consider today:

Do students who study more get less sleep?
Do underclassmen (freshmen and sophomores) get less sleep than upperclassmen (juniors and seniors)?

The R Data Analysis Workflow

Read/import data into R

Determine if special R packages needed to read data

Data management

Check data quality
Compute new variables
Filter rows
Select columns

Data analysis

Descriptive statistics
Plots
Statistical tests and models

Working in R, part 1 | Importing text-based data

Step 1: Get data into R

Data frames are R’s object type for traditional “data sets”
Base R function to import text and CSV files as a data frame: read.table or read.csv

mydata <- read.csv(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.csv")

This code creates an object named mydata containing the data from this file as a data frame.

Step 1: Get data into R

What if the data isn’t in CSV or plaintext format?

Oftentimes, it is necessary to find an R package that can read other data formats. Some packages I would recommend are:
- Excel: readxl
- SAS, SPSS, and Stata 13+: haven
To use an R package, we must first install it from a host repository
- This is done by running the function install.packages(...) in the console (only need to do this once)

install.packages("haven")

Once a package has been installed, it must be loaded using the library(...) function at the start of your R session

library("haven")
mydata <- haven::read_spss(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.sav")

Step 2: Look at the data we imported into R

How do we know if our import was successful?

The following functions take a data frame as an argument, and return previews or summaries about that data frame:

View the dataframe as a spreadsheet using View(mydata)
Print variable types using function str(mydata) (str is short for structure)
Print minimum, maximum, median, missing values for all variables in dataframe using summary(mydata)
Print first 5 rows of dataframe using head(mydata)
Print last 5 rows of dataframe using tail(mydata)
Print names of variables in dataframe using names(mydata)

Step 3: Access variables within our dataset

Some functions expect a dataframe; other functions expect a vector of observations – that is, a single, specific column from a dataframe
Variables are simply a named vector inside our dataframe

When we need to access a column of a dataframe as a vector, we use the $ operator to extract it:

mydata$SleepTime

This syntax can be combined with the assignment operator to add new variables to a dataframe, or edit existing variables in a dataframe:

mydata$SleepTime_minutes <- mydata$SleepTime*60

Step 4: Subsetting rows or columns of our dataset

Normally when we want to access specific rows of a dataframe, what we really want is to filter our data by some condition. We can do this using function subset():

freshmen_only <- subset(mydata, subset = Rank==1)

The first argument to subset is the name of the dataframe. The second argument is a logical condition for which rows to keep (here, Rank==1, i.e. keep only freshman).

If we instead want to drop or keep specific columns of our dataset, we can also use the subset function, but use its select argument:

grade_data <- subset(mydata, select = English:Science)
fitness_data <- subset(mydata, select = c(Athlete, Smoking, Mile))
data_dropbday <- subset(mydata, select = -bday)

Working in R, part 2 | Summary statistics

Summary statistics for continuous numeric variables

mean(), sd(), min(), max(), median(), sum()
- The na.rm=TRUE argument
Base R graphics: hist(), boxplot()

Summary statistics for categorical variables

table()
addmargins()
prop.table()
Base R graphics: barplot(table(...))

Working in R, part 3 | Data analysis

Examples

Correlation using cor()
Linear regression using lm()

Questions

Appendix A | R version used for this workshop

R version used for this workshop

devtools::session_info()[[1]]

 setting  value
 version  R version 4.5.1 (2025-06-13 ucrt)
 os       Windows 11 x64 (build 26100)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2025-10-28
 pandoc   3.6.3 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
 quarto   NA @ C:\\Users\\kyeager4\\AppData\\Local\\Programs\\Quarto\\bin\\quarto.exe

R package versions used for this workshop

devtools::package_info(c("haven"), dependencies=FALSE)

	package	ondiskversion	loadedversion	path	loadedpath	attached	is_base	date	source	md5ok	library
haven	haven	2.5.5	2.5.5	C:/Users/kyeager4/AppData/Local/R/win-library/4.5/haven	C:/Users/kyeager4/AppData/Local/R/win-library/4.5/haven	TRUE	FALSE	2025-05-30	CRAN (R 4.5.1)	TRUE	C:/Users/kyeager4/AppData/Local/R/win-library/4.5