install.packages("tidyverse")R Tidyverse
The most popular R dialect
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
- Know the components of the R tidyverse packages.
- Tell the difference between long tables and wide tables, and convert between them.
- Use the tidyverse functions for group summary.
2 Introduction
2.1 Definition
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
A dialect of R.
2.2 Installation
2.3 Members
library(tidyverse)
tidyverse_packages() [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"    Core members:
- ggplot2: Creating graphics.
- dplyr: Data manipulation.
- tidyr: Get to tidy data.
- readr: Read rectangular data.
- purrr: Functional programming.
- tibble: Re-imagining of the data frame.
- stringr: Working with strings.
- forcats: Working with factors.
2.4 Workflow
3 Tidy data
3.1 Rules
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Which is a tidy data set?
table1
table2
table3
table4a
table4b
table53.2 Why tidy data
# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)
# Compute cases per year
table1 %>% 
  count(year, wt = cases)3.3 Conversions
tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
table2 %>%
    pivot_wider(names_from = type, values_from = count)4 Pipe operator
Suppose \(y\) is a function of \(x\), and \(z\) is a function of \(y\):
\[y = f(x)\]
\[z = g(y)\]
How should we calculate \(z\) if we know \(x\)?
Method 1: step by step:
y <- f(x)
z <- g(y)Method 2: one stop:
z <- g(f(x))What problems do they have?
Using pipe:
# %>% or |> 
z <- x %>% 
  f() %>% 
  g()An example:
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)# Method 1:
y1 <- log(x)
y2 <- diff(y1)
y3 <- exp(y2)
z <- round(y3)# Method 2
z <- round(exp(diff(log(x))))# Pipe
z <- x %>% 
  log() %>% 
  diff() %>% 
  exp() %>% 
  round()5 Cases
5.1 Case 1: Typical work flow
Draw a box plot for every variable of each species for the iris data.
# base R
par(mfrow = c(2, 2))
for (i in 1:4) {
  boxplot(iris[, i] ~ iris$Species, las = 1, xlab = 'Species', ylab = names(iris)[i])
}
# tidyverse
iris |> 
  pivot_longer(-Species) |> 
  ggplot() +
  geom_boxplot(aes(Species, value)) + 
  facet_wrap(name ~.)5.2 Case 2: Monthly statistics
Calculate the means, medians, and standard deviations of each variables for each species of the iris dataset.
# base R
dtf1_mean <- data.frame(tapply(iris$Sepal.Length, iris$Species, mean, na.rm = TRUE))
dtf1_sd <- data.frame(tapply(iris$Sepal.Length, iris$Species, sd, na.rm = TRUE))
dtf1_median <- data.frame(tapply(iris$Sepal.Length, iris$Species, median, na.rm = TRUE))
# and other variables
# use a loop
dtf <- data.frame(rep(NA, 3))
for (i in 1:4) {
  dtf1_mean <- data.frame(tapply(iris[, i], iris$Species, mean, na.rm = TRUE))
  dtf1_sd <- data.frame(tapply(iris[, i], iris$Species, sd, na.rm = TRUE))
  dtf1_median <- data.frame(tapply(iris[, i], iris$Species, median, na.rm = TRUE))
  dtf1 <- cbind(dtf1_mean, dtf1_sd, dtf1_median)
  names(dtf1) <- paste0(names(iris)[i], '.', c('mean', 'sd', 'median'))
  dtf <- cbind(dtf, dtf1)
}
# tidyverse
dtf <- iris |> 
  pivot_longer(-Species) |> 
  group_by(Species, name) |> 
  summarise(mean = mean(value, na.rm = TRUE),
            sd   = sd(value, na.rm = TRUE),
            median = median(value, na.rm = TRUE))