<- 'The quick brown fox jumps over the lazy dog'
x <- read.csv('data/student_names.csv') dtf
R for characters
Dr. Peng Zhao (✉ peng.zhao@xjtlu.edu.cn)
Department of Health and Environmental Sciences
Xi’an Jiaotong-Liverpool University
1 Learning objectives
- R functions for split, concatenate, find, and replace characters.
- Basic usage of regular expressions.
2 R functions for characters
Function | Usage |
---|---|
readLines(), writeLines() | Read and save files with characters |
tolower(), toupper() | Change the case of the characters |
nchar() | Number of character |
strsplit(), substr(), substring() | Split and extract characters |
paste(), cat() | Connect characters |
grep(), gsub(), sub(), chartr() | Find and replace characters |
table(), unique(), duplicated() | Count and find duplicated characters |
3 Basic
3.1 Brief view
class(x)
length(x)
nchar(x)
class(dtf$Name)
length(dtf$Name)
nchar(dtf$Name)
Whose name is the longest?
<- nchar(dtf$Name)
name_n <- which.max(name_n)
name_nmax $Name[name_nmax]
dtf
# or
$Name[which.max(nchar((dtf$Name)))]
dtf
# or
library(magrittr)
$Name %>% nchar() %>% which.max() %>% dtf$Name[.] dtf
3.2 Case
# tolower() toupper()
<- tolower(x)
xlower $pro <- tolower(dtf$Prgrm) dtf
3.3 Split
# strsplit()
<- strsplit(xlower, '')
xsingle class(xsingle)
<- xsingle[[1]]
xsingle1 class(xsingle1)
nchar(xsingle1)
length(xsingle1)
table(xsingle1)
duplicated(xsingle1)
!duplicated(xsingle1)]
xsingle1[unique(xsingle1)
length(unique(xsingle1))
<- strsplit(xlower, split = ' ')
x_word
<- strsplit(dtf$Name, ' ')
name_split class(name_split)
length(name_split)
nchar(name_split)
lapply(name_split, length)
sapply(name_split, length)
sapply(name_split, nchar)
Who has the most names?
$Name[which.max(sapply(name_split, length))] dtf
# separate()
library(tidyr)
<- separate(dtf, Name, c("GivenName", "LastName"), sep = ' ')
dtf2 $FamilyName <- dtf2$LastName dtf
# substr()
substr(x, 13, 15)
$NameAbb <- substr(dtf$Name, 1, 1) dtf
3.4 Concatenate
# paste()
paste(x, '.', sep = '')
paste0(x, '.')
paste(dtf$NameAbb, '.', sep = '')
paste(dtf$NameAbb, collapse = ' ')
paste(dtf$NameAbb, dtf$FamilyName, sep = '. ')
3.5 Find
# grep()
grep('r', x)
grep('r', xsingle1)
grep('-', x)
for(i in 1:length(xsingle1)) {
print(grep(xsingle1[i], xsingle1))
}for(i in 1:length(xsingle1)) {
print(paste(xsingle1[i], grep(xsingle1[i], xsingle1)))
}sapply(xsingle1, function(x) grep(x, xsingle1))
table(dtf2$GivenName)
grep('Jiayi', dtf$Name, value = TRUE)
grep('Jiayi|Guo', dtf$Name, value = TRUE)
# regexpr()
regexpr('r', x)
gregexpr('r', x)
regexpr(' ', dtf$Name)
gregexpr(' ', dtf$Name)
3.6 Replace
# gsub()
gsub(' ', '-', x)
eagsub('E', 'e', dtf$Prgrm)
4 Advanced
4.1 The stringr package
4.2 Regular expression
A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that define a search pattern. Usually such patterns are used by string searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.
— Wikipedia
help(regex)
Find:
# Find the one who has a given name with 4 letters and a family name with 4 letters
grep('^[[:alpha:]]{4} [[:alpha:]]{4}$', dtf$Name, value = TRUE)
Replace:
$FirstName <- gsub('^([^ ]+).+[^ ]+$', '\\1', dtf$Name) dtf
5 Further readings
- R for Data Science. Chapter 14.
- The R Book - Chapter 2.12
6 Exercises
Here is a sentence in German:
<- 'Victor jagt zwölf Boxkämpfer quer über den großen Sylter Deich' x
How many characters are there in the sentence, excluding the blank spaces?
Do these characters cover all the German letters?
Which character is repeated the most times?
Which is the longest word in the sentence?
Give a new sentence with the number replaced with Arabic number (zwölf -> 12).
Process the student_name dataset.
Whose name is the shortest?
Add a new column as the family name. How many family names do we have? Which family name is repeated the most times?
Add a new column with the family name followed by the give name.
Order the data frame by the family name
- Text could be encrypted in a simple rule like this: C –> D, A –> B, K –> L, E –> F: CAKE –> DBLF (+1) C –> G, A –> E, K –> O, E –> I: CAKE –> GEOI (+4) Decrypt the following sentence which was encrypted with the rule above:
skt grcgey xkskshkx rubk hkigayk ul xusgtik utre