R for characters

Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)

Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University

1 Learning objectives

  • R functions for split, concatenate, find, and replace characters.
  • Basic usage of regular expressions.

2 R functions for characters

Function Usage
readLines(), writeLines() Read and save files with characters
tolower(), toupper() Change the case of the characters
nchar() Number of character
strsplit(), substr(), substring() Split and extract characters
paste(), cat() Connect characters
grep(), gsub(), sub(), chartr() Find and replace characters
table(), unique(), duplicated() Count and find duplicated characters

3 Basic

x <- 'The quick brown fox jumps over the lazy dog'
dtf <- read.csv('data/student_names.csv')

3.1 Brief view

class(x)
length(x)
nchar(x)

class(dtf$Name)
length(dtf$Name)
nchar(dtf$Name)
Whose name is the longest?
name_n <- nchar(dtf$Name)
name_nmax <- which.max(name_n)
dtf$Name[name_nmax]

# or
dtf$Name[which.max(nchar((dtf$Name)))]

# or
library(magrittr)
dtf$Name %>% nchar() %>% which.max() %>% dtf$Name[.]

3.2 Case

# tolower() toupper()
xlower <- tolower(x)
dtf$pro <- tolower(dtf$Prgrm)

3.3 Split

# strsplit()

xsingle <- strsplit(xlower, '')
class(xsingle)
xsingle1 <- xsingle[[1]]
class(xsingle1)
nchar(xsingle1)
length(xsingle1)

table(xsingle1)
duplicated(xsingle1)
xsingle1[!duplicated(xsingle1)]
unique(xsingle1)
length(unique(xsingle1))


x_word <- strsplit(xlower, split = ' ')

name_split <- strsplit(dtf$Name, ' ')
class(name_split)
length(name_split)
nchar(name_split)

lapply(name_split, length)
sapply(name_split, length)
sapply(name_split, nchar)
Who has the most names?
dtf$Name[which.max(sapply(name_split, length))]
# separate()
library(tidyr)
dtf2 <- separate(dtf, Name, c("GivenName", "LastName"), sep = ' ')
dtf$FamilyName <- dtf2$LastName
# substr()
substr(x, 13, 15)
dtf$NameAbb <- substr(dtf$Name, 1, 1)

3.4 Concatenate

# paste()
paste(x, '.', sep = '')
paste0(x, '.')

paste(dtf$NameAbb, '.', sep = '')
paste(dtf$NameAbb, collapse = ' ')

paste(dtf$NameAbb, dtf$FamilyName, sep = '. ')

3.5 Find

# grep()
grep('r', x)
grep('r', xsingle1)
grep('-', x)
for(i in 1:length(xsingle1)) {
  print(grep(xsingle1[i], xsingle1))
}
for(i in 1:length(xsingle1)) {
  print(paste(xsingle1[i], grep(xsingle1[i], xsingle1)))
}
sapply(xsingle1, function(x) grep(x, xsingle1))


table(dtf2$GivenName)
grep('Jiayi', dtf$Name, value = TRUE)
grep('Jiayi|Guo', dtf$Name, value = TRUE)

# regexpr()
regexpr('r', x)
gregexpr('r', x)

regexpr(' ', dtf$Name)
gregexpr(' ', dtf$Name)

3.6 Replace

# gsub()
gsub(' ', '-', x)
eagsub('E', 'e', dtf$Prgrm)

4 Advanced

4.1 The stringr package

4.2 Regular expression

A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

— Wikipedia

help(regex)

Find:

# Find the one who has a given name with 4 letters and a family name with 4 letters
grep('^[[:alpha:]]{4} [[:alpha:]]{4}$', dtf$Name, value = TRUE)

Replace:

dtf$FirstName <- gsub('^([^ ]+).+[^ ]+$', '\\1', dtf$Name)

5 Further readings

6 Exercises

  1. Here is a sentence in German:

    x <- 'Victor jagt zwölf Boxkämpfer quer über den großen Sylter Deich'
    1. How many characters are there in the sentence, excluding the blank spaces?

    2. Do these characters cover all the German letters?

    3. Which character is repeated the most times?

    4. Which is the longest word in the sentence?

    5. Give a new sentence with the number replaced with Arabic number (zwölf -> 12).

  2. Process the student_name dataset.

    1. Whose name is the shortest?

    2. Add a new column as the family name. How many family names do we have? Which family name is repeated the most times?

    3. Add a new column with the family name followed by the give name.

    4. Order the data frame by the family name

  1. Text could be encrypted in a simple rule like this: C –> D, A –> B, K –> L, E –> F: CAKE –> DBLF (+1) C –> G, A –> E, K –> O, E –> I: CAKE –> GEOI (+4) Decrypt the following sentence which was encrypted with the rule above:

skt grcgey xkskshkx rubk hkigayk ul xusgtik utre