Intro to R for Data Analysis

Kristin Yeager

Kent State University Libraries

Apr 2024

In this workshop

What is R, and why should I use it?
Key terms
Worked example of the R workflow
- Setting up RStudio project
- Importing data
- Data cleaning/data management
- Data analysis

Motivation

What we have: Data

Suppose we have data we want to “work with”, and our data is arranged like the following:

First row contains variable names
Subsequent rows contain observations (1 row = 1 respondent or unit)
Each columns represents a variable (1 column = 1 distinct variable)
Common file formats for this type of data: CSV, TXT

Screenshot of structured data with first row containing variable names and subsequent rows containing data values

What can we do with this data?

What we want: Statistics

	SleepTime
Predictors	Estimates	CI	p
(Intercept)	9.73	9.24 – 10.23	<0.001
StudyTime	-0.30	-0.37 – -0.23	<0.001
Rank [So]	-0.74	-1.27 – -0.20	0.007
Rank [Jr]	-0.47	-1.10 – 0.17	0.149
Rank [Sr]	0.34	-0.69 – 1.36	0.519
Observations	359
R² / R² adjusted	0.263 / 0.255

What we want: Graphs

What we really want: Communication

What we really want is to be able to understand our data and communicate our findings to others (in whatever format that make take):

What we really want: Reproducibility

And we want to do these things in a way that’s consistent, transparent, and repeatable:

library(haven)
library(sjPlot)

#' Read data from SPSS and apply labels
df <- haven::read_spss("./data/Sample_Dataset_2019.sav")
df$Rank <- factor(df$Rank, 
                  levels=1:4, 
                  labels=c("Fr", "So", "Jr", "Sr"))

#' Fit linear regression model
mylm <- lm(SleepTime ~ StudyTime + Rank, data=df)

#' Generate nice table with parameter estimates and p-values
sjPlot::tab_model(mylm)

R and RStudio can help you achieve all of these things

About R

What it is: A programming language designed for statistical data analysis

An open source statistical software program

A community of data scientists and practitioners
Operating systems: Windows, Mac, Linux
Cost: Free!

How to use R

Option 1: Run R/RStudio Locally

To install R:

cran.r-project.org

To install RStudio:

rstudio.com

Option 2: Run R/RStudio in the Cloud

Create account:

posit.cloud

The R user interface

The console is the “command line” interface part of R – type code into it, press enter, get results back
The console is fine for “playing around”, but most of the time, you’ll put everything into an R script – a file containing executable R code

The RStudio User Interface

Using RStudio is completely optional, but it has many helpful features to make writing and managing R code easier

R syntax

Necessary vocabulary before we can get to the “good stuff”

Basic usage

Type commands into the R console, press “enter”, see results
R is a great basic calculator! Try doing some basic arithmetic like addition (+), subtraction (-), multiplication (*), division (/), powers/exponents (** or ^)

12+13

[1] 25

12*13

[1] 156

12/13

[1] 0.9230769

12**2

[1] 144

Objects

To really benefit from R, we can’t type and re-type every operation by hand: we need to be able to store and retrieve the results of calculations.
Objects are named “things” within R that we can access or manipulate.
Objects are created using the assignment operator: <-

object_name <- value_to_assign_to_object

Functions

Similarly, functions are named “actions” or formulas we can perform in R.
Functions usually “look like” a word followed by parentheses, with “stuff” in between the parentheses:

mean(...)

When you start using R, you’ll mostly be working with pre-existing functions from base R or R packages.

Types of Objects: Vectors

Vectors are a special type of object that represent a set of numbers (or other data values, like letters).
- Created using the c() function:

new_vector <- c(3, 1, 1, 0)
print(new_vector)

[1] 3 1 1 0

new_vector_char <- c("a", "b", "c", "d")
print(new_vector_char)

[1] "a" "b" "c" "d"

Types of Objects: Dataframes

Dataframes are R’s representation of the “grid-like” data structure seen earlier
- Each column within a dataframe is technically a vector (most of the time)
We can create small dataframes using the data.frame() function, but most of the time we’ll create them by importing an external data file

mydf <- data.frame(new_vector=c(3, 1, 1, 0),
                   new_vector_letters=c("a", "b", "c", "d"))
print(mydf)

  new_vector new_vector_letters
1          3                  a
2          1                  b
3          1                  c
4          0                  d

Putting it together: Doing arithmetic using objects

We can use functions and arithmetic operators on our named objects instead of having to re-write the original values every time we do a calculation:

new_object <- 5
print(new_object)

[1] 5

new_vector <- c(3, 1, 1, 0)
print(new_vector)

[1] 3 1 1 0

new_vector + new_object

[1] 8 6 6 5

Putting it together: Using functions on objects

Example: Compute the mean of new_vector:

#Reminder of the values of new_vector
print(new_vector)

[1] 3 1 1 0

# Calculate the mean of new_vector
mean(new_vector)

[1] 1.25

Looking up functions

You can open a function’s documentation page using the help() function:

help("mean")

R packages

R can do a lot, but there are new analytic methods coming out all the time. That’s where R packages – and the power of the R community – comes in.

R packages are user-created modules of code intended to fulfill a narrow task or purpose
- e.g. Reading Excel files into R
Packages are shared through repositories, which are like moderated archives for sharing of code
- CRAN is the “main” repo. See CRAN Task Views to explore R packages by topic
- See also: Bioconductor

Using R packages

To use an R package, we must first install it from a host repository
- This is done by running the function install.packages(...) in the console (only need to do this once)
Once a package has been installed, it must be loaded using the library(...) function at the start of your R session

install.packages("ggplot2")
library("ggplot2")

Worked Example

Today’s sample data

Tutorial sample data:

Simulated survey dataset we use for our online tutorials
n=435 “college students”
Available in CSV, TXT, SAS, SPSS formats
Download from our website: libguides.library.kent.edu/SPSS/ (direct download link for CSV, direct download link for SPSS format)

Research questions we’ll consider today:

Do students who study more get less sleep?
Do underclassmen (freshmen and sophomores) get less sleep than upperclassmen (juniors and seniors)?

The R Data Analysis Workflow

Read/import data into R

Determine if special R packages needed to read data

Data management

Check data quality
Compute new variables
Filter rows
Select columns

Data analysis

Descriptive statistics
Plots
Statistical tests and models

Working in R, part 1 | Importing text-based data

Step 1: Get data into R

Data frames are R’s object type for traditional “data sets”
Base R function to import text and CSV files as a data frame: read.table or read.csv

mydata <- read.csv(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.csv")

This code creates an object named mydata containing the data from this file as a data frame.

Step 1: Get data into R

What if the data isn’t in CSV or plaintext format?

Oftentimes, it is necessary to find an R package that can read other data formats. Some packages I would recommend are:
- Excel: readxl
- SAS, SPSS, and Stata 13+: haven

Example: Reading SPSS-format dataset into R using package haven (relative file path)

library(haven)
mydata <- haven::read_spss(file="C:/Users/yourname/yourfolder/Sample_Dataset_2019.sav")

Step 2: Look at the data we imported into R

How do we know if our import was successful?

The following functions take a data frame as an argument, and return previews or summaries about that data frame:

View the dataframe as a spreadsheet using View(mydata)
Print variable types using function str(mydata) (str is short for structure)
Print minimum, maximum, median, missing values for all variables in dataframe using summary(mydata)
Print first 5 rows of dataframe using head(mydata)
Print last 5 rows of dataframe using tail(mydata)
Print names of variables in dataframe using names(mydata)

Step 3: Access variables within our dataset

Some functions expect a dataframe; other functions expect a vector of observations – that is, a single, specific column from a dataframe
Variables are simply a named vector inside our dataframe

When we need to access a column of a dataframe as a vector, we use the $ operator to extract it:

mydata$Mile

This syntax can be combined with the assignment operator to add new variables to a dataframe, or edit existing variables in a dataframe:

mydata$Mile_minutes <- mydata$Mile/60

Step 4: Subsetting rows or columns of our dataset

Normally when we want to access specific rows of a dataframe, what we really want is to filter our data by some condition. We can do this using function subset():

freshmen_only <- subset(mydata, subset = Rank==1)

The first argument to subset is the name of the dataframe. The second argument is a logical condition for which rows to keep (here, Rank==1, i.e. keep only freshman).

If we instead want to drop or keep specific columns of our dataset, we can also use the subset function, but use its select argument:

grade_data <- subset(mydata, select = English:Science)
fitness_data <- subset(mydata, select = c(Athlete, Smoking, Mile))
data_dropbday <- subset(mydata, select = -bday)

Working in R, part 2 | Summary statistics

Summary statistics for continuous numeric variables

mean(), sd(), min(), max(), median(), sum()
- The na.rm=TRUE argument
Base R graphics: hist(), boxplot()

Summary statistics for categorical variables

table()
addmargins()
prop.table()
Base R graphics: barplot(table(...))

Working in R, part 3 | Data analysis

Examples

Correlation using cor()
Linear regression using lm()

Questions

Appendix A | R version used for this workshop

R version used for this workshop

devtools::session_info()[[1]]

 setting  value
 version  R version 4.3.3 (2024-02-29 ucrt)
 os       Windows 10 x64 (build 19045)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2024-04-11
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

R package versions used for this workshop

devtools::package_info(c("haven"), dependencies=FALSE)

	package	ondiskversion	loadedversion	path	loadedpath	attached	is_base	date	source	md5ok	library
haven	haven	2.5.4	2.5.4	C:/Users/kyeager4/AppData/Local/R/win-library/4.3/haven	C:/Users/kyeager4/AppData/Local/R/win-library/4.3/haven	TRUE	FALSE	2023-11-30	CRAN (R 4.3.2)	TRUE	C:/Users/kyeager4/AppData/Local/R/win-library/4.3