install.packages("tidyverse")
R Tidyverse
The most popular R dialect
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
- Know the components of the R tidyverse packages.
- Tell the difference between long tables and wide tables, and convert between them.
- Use the tidyverse functions for group summary.
2 Introduction
2.1 Definition
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
A dialect of R.
2.2 Installation
2.3 Members
library(tidyverse)
tidyverse_packages()
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
Core members:
- ggplot2: Creating graphics.
- dplyr: Data manipulation.
- tidyr: Get to tidy data.
- readr: Read rectangular data.
- purrr: Functional programming.
- tibble: Re-imagining of the data frame.
- stringr: Working with strings.
- forcats: Working with factors.
2.4 Workflow
3 Tidy data
3.1 Rules
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Which is a tidy data set?
table1
table2
table3
table4a
table4b table5
3.2 Why tidy data
# Compute rate per 10,000
%>%
table1 mutate(rate = cases / population * 10000)
# Compute cases per year
%>%
table1 count(year, wt = cases)
3.3 Conversions
<- table4a %>%
tidy4a pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
<- table4b %>%
tidy4b pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
%>%
table2 pivot_wider(names_from = type, values_from = count)
4 Pipe operator
Suppose \(y\) is a function of \(x\), and \(z\) is a function of \(y\):
\[y = f(x)\]
\[z = g(y)\]
How should we calculate \(z\) if we know \(x\)?
Method 1: step by step:
<- f(x)
y <- g(y) z
Method 2: one stop:
<- g(f(x)) z
What problems do they have?
Using pipe:
# %>% or |>
<- x %>%
z f() %>%
g()
An example:
<- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907) x
# Method 1:
<- log(x)
y1 <- diff(y1)
y2 <- exp(y2)
y3 <- round(y3) z
# Method 2
<- round(exp(diff(log(x)))) z
# Pipe
<- x %>%
z log() %>%
diff() %>%
exp() %>%
round()
5 Cases
5.1 Case 1: Typical work flow
Draw a box plot for every variable of each species for the iris data.
# base R
par(mfrow = c(2, 2))
for (i in 1:4) {
boxplot(iris[, i] ~ iris$Species, las = 1, xlab = 'Species', ylab = names(iris)[i])
}
# tidyverse
|>
iris pivot_longer(-Species) |>
ggplot() +
geom_boxplot(aes(Species, value)) +
facet_wrap(name ~.)
5.2 Case 2: Monthly statistics
Calculate the means, medians, and standard deviations of each variables for each species of the iris dataset.
# base R
<- data.frame(tapply(iris$Sepal.Length, iris$Species, mean, na.rm = TRUE))
dtf1_mean <- data.frame(tapply(iris$Sepal.Length, iris$Species, sd, na.rm = TRUE))
dtf1_sd <- data.frame(tapply(iris$Sepal.Length, iris$Species, median, na.rm = TRUE))
dtf1_median # and other variables
# use a loop
<- data.frame(rep(NA, 3))
dtf for (i in 1:4) {
<- data.frame(tapply(iris[, i], iris$Species, mean, na.rm = TRUE))
dtf1_mean <- data.frame(tapply(iris[, i], iris$Species, sd, na.rm = TRUE))
dtf1_sd <- data.frame(tapply(iris[, i], iris$Species, median, na.rm = TRUE))
dtf1_median <- cbind(dtf1_mean, dtf1_sd, dtf1_median)
dtf1 names(dtf1) <- paste0(names(iris)[i], '.', c('mean', 'sd', 'median'))
<- cbind(dtf, dtf1)
dtf
}
# tidyverse
<- iris |>
dtf pivot_longer(-Species) |>
group_by(Species, name) |>
summarise(mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE),
median = median(value, na.rm = TRUE))