R Tidyverse

The most popular R dialect

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objectives

  • Know the components of the R tidyverse packages.
  • Tell the difference between long tables and wide tables, and convert between them.
  • Use the tidyverse functions for group summary.

2 Introduction

2.1 Definition

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

A dialect of R.

2.2 Installation

install.packages("tidyverse")

2.3 Members

library(tidyverse)
tidyverse_packages()
 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"    

Core members:

  • ggplot2: Creating graphics.
  • dplyr: Data manipulation.
  • tidyr: Get to tidy data.
  • readr: Read rectangular data.
  • purrr: Functional programming.
  • tibble: Re-imagining of the data frame.
  • stringr: Working with strings.
  • forcats: Working with factors.

2.4 Workflow

3 Tidy data

3.1 Rules

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Which is a tidy data set?

table1
table2
table3
table4a
table4b
table5

3.2 Why tidy data

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>% 
  count(year, wt = cases)

3.3 Conversions

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

left_join(tidy4a, tidy4b)

table2 %>%
    pivot_wider(names_from = type, values_from = count)

4 Pipe operator

Suppose \(y\) is a function of \(x\), and \(z\) is a function of \(y\):

\[y = f(x)\]

\[z = g(y)\]

How should we calculate \(z\) if we know \(x\)?

Method 1: step by step:

y <- f(x)
z <- g(y)

Method 2: one stop:

z <- g(f(x))

What problems do they have?

Using pipe:

# %>% or |> 
z <- x %>% 
  f() %>% 
  g()

An example:

x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
# Method 1:
y1 <- log(x)
y2 <- diff(y1)
y3 <- exp(y2)
z <- round(y3)
# Method 2
z <- round(exp(diff(log(x))))
# Pipe

z <- x %>% 
  log() %>% 
  diff() %>% 
  exp() %>% 
  round()

5 Cases

5.1 Case 1: Typical work flow

Draw a box plot for every variable of each species for the iris data.

# base R
par(mfrow = c(2, 2))
for (i in 1:4) {
  boxplot(iris[, i] ~ iris$Species, las = 1, xlab = 'Species', ylab = names(iris)[i])
}

# tidyverse
iris |> 
  pivot_longer(-Species) |> 
  ggplot() +
  geom_boxplot(aes(Species, value)) + 
  facet_wrap(name ~.)

5.2 Case 2: Monthly statistics

Calculate the means, medians, and standard deviations of each variables for each species of the iris dataset.

# base R
dtf1_mean <- data.frame(tapply(iris$Sepal.Length, iris$Species, mean, na.rm = TRUE))
dtf1_sd <- data.frame(tapply(iris$Sepal.Length, iris$Species, sd, na.rm = TRUE))
dtf1_median <- data.frame(tapply(iris$Sepal.Length, iris$Species, median, na.rm = TRUE))
# and other variables

# use a loop
dtf <- data.frame(rep(NA, 3))
for (i in 1:4) {
  dtf1_mean <- data.frame(tapply(iris[, i], iris$Species, mean, na.rm = TRUE))
  dtf1_sd <- data.frame(tapply(iris[, i], iris$Species, sd, na.rm = TRUE))
  dtf1_median <- data.frame(tapply(iris[, i], iris$Species, median, na.rm = TRUE))
  dtf1 <- cbind(dtf1_mean, dtf1_sd, dtf1_median)
  names(dtf1) <- paste0(names(iris)[i], '.', c('mean', 'sd', 'median'))
  dtf <- cbind(dtf, dtf1)
}

# tidyverse
dtf <- iris |> 
  pivot_longer(-Species) |> 
  group_by(Species, name) |> 
  summarise(mean = mean(value, na.rm = TRUE),
            sd   = sd(value, na.rm = TRUE),
            median = median(value, na.rm = TRUE))

6 Further readings