Mastering Regular Expressions: Dealing with String Data in R, Part I

R
tidyverse
regex
string
text
Solving R for Data Science (2ed) Exercises
Author

Carlos Fernández

Published

June 19, 2024

Introduction

Regular expressions (regex) are a powerful tool for working with string data in R. They might seem complex at first, but with some practice, they can become an invaluable part of your data science toolkit. In this blog post, we will tackle the first three exercises from the “R for Data Science” (2nd edition) book on regular expressions. Let’s dive into the world of regex and see how we can manipulate and search text data effectively.

Setting Up

First, let’s load the necessary library:

library(tidyverse)

Exercise 1: Matching Literal Strings

Question: How would you match the literal string "'\? How about "$^$"?

Solution:

To match these literal strings, we need to handle special characters carefully. Special characters in regex need to be escaped with a backslash (\). Here’s how we can do it:

string_1 <- r"("'\)"
string_2 <- r"("$^$")"

# Visualize the strings
str_view(c(string_1, string_2))
[1] │ "'\
[2] │ "$^$"
# Using escape backslashes
str_view(string_1, "\"\'\\\\") # Matches "'\
[1] │ <"'\>
str_view(string_2, "\"\\$\\^\\$\"") # Matches "$^$"
[1] │ <"$^$">
# Using character classes
str_view(string_2, "\"[$]\\^[$]\"") # Matches "$^$" using character classes
[1] │ <"$^$">
# Using raw strings (simplifies escaping)
str_view(string_1, r"("'\\)") 
[1] │ <"'\>
str_view(string_2, r"("\$\^\$")")
[1] │ <"$^$">

Exercise 2: Why Patterns Don’t Match a Backslash

Question: Explain why each of these patterns don’t match a \:

  • “\”

  • “\\”

  • “\\\”

Solution:

Let’s break down why these patterns fail to match a single backslash:

string <- r"(\)"
str_view(string)
[1] │ \
# str_view(string, "\") 
# This escapes the ", and the code is left incomplete

# str_view(string, "\\") 
# This throws an error "Unrecognized backslash escape secuence", \\ is used to escape special characters, but none follows it

# str_view(string, "\\\")
# This escapes the ", and the code is left incomplete

# Correct way:
str_view(string, "\\\\") # This works because \\ in regex represents a literal backslash.
[1] │ <\>
# Using raw strings (simplifies escaping):
str_view(string, r"(\\)") # Only needs to escape the backslash once.
[1] │ <\>

Exercise 3: Searching Within a Corpus

Question: Given the corpus of common words in stringr::words, create regular expressions that find all words that:

a. Start with “y”.

b. Don’t start with “y”.

c. End with “x”.

d. Are exactly three letters long. (Don’t cheat by using str_length()!)

e. Have seven letters or more.

f. Contain a vowel-consonant pair.

g. Contain at least two vowel-consonant pairs in a row.

h. Only consists of repeated vowel-consonant pairs.

Solution:

Here are the regex patterns to match each condition:

# Visualize all words
str_view(words)
 [1] │ a
 [2] │ able
 [3] │ about
 [4] │ absolute
 [5] │ accept
 [6] │ account
 [7] │ achieve
 [8] │ across
 [9] │ act
[10] │ active
[11] │ actual
[12] │ add
[13] │ address
[14] │ admit
[15] │ advertise
[16] │ affect
[17] │ afford
[18] │ after
[19] │ afternoon
[20] │ again
... and 960 more
# a. Start with "y".
str_view(words, "^y")
[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
[980] │ <y>oung
# b. Don't start with "y".
str_view(words[!str_detect(words, "^y")])
 [1] │ a
 [2] │ able
 [3] │ about
 [4] │ absolute
 [5] │ accept
 [6] │ account
 [7] │ achieve
 [8] │ across
 [9] │ act
[10] │ active
[11] │ actual
[12] │ add
[13] │ address
[14] │ admit
[15] │ advertise
[16] │ affect
[17] │ afford
[18] │ after
[19] │ afternoon
[20] │ again
... and 954 more
# c. End with "x".
str_view(words, "x$")
[108] │ bo<x>
[747] │ se<x>
[772] │ si<x>
[841] │ ta<x>
# d. Are exactly three letters long. (Don't cheat by using str_length()!)
str_view(words, "^[a-z]{3}$")
  [9] │ <act>
 [12] │ <add>
 [22] │ <age>
 [24] │ <ago>
 [26] │ <air>
 [27] │ <all>
 [38] │ <and>
 [41] │ <any>
 [51] │ <arm>
 [54] │ <art>
 [56] │ <ask>
 [68] │ <bad>
 [69] │ <bag>
 [73] │ <bar>
 [82] │ <bed>
 [89] │ <bet>
 [91] │ <big>
 [94] │ <bit>
[108] │ <box>
[109] │ <boy>
... and 90 more
# e. Have seven letters or more.
str_view(words, "^[a-z]{7,}$")
 [4] │ <absolute>
 [6] │ <account>
 [7] │ <achieve>
[13] │ <address>
[15] │ <advertise>
[19] │ <afternoon>
[21] │ <against>
[31] │ <already>
[32] │ <alright>
[34] │ <although>
[36] │ <america>
[39] │ <another>
[43] │ <apparent>
[46] │ <appoint>
[47] │ <approach>
[48] │ <appropriate>
[53] │ <arrange>
[57] │ <associate>
[61] │ <authority>
[62] │ <available>
... and 198 more
# f. Contain a vowel-consonant pair.
str_view(words, "[aeiou][^aeiou]")
 [2] │ <ab>le
 [3] │ <ab>o<ut>
 [4] │ <ab>s<ol><ut>e
 [5] │ <ac>c<ep>t
 [6] │ <ac>co<un>t
 [7] │ <ac>hi<ev>e
 [8] │ <ac>r<os>s
 [9] │ <ac>t
[10] │ <ac>t<iv>e
[11] │ <ac>tu<al>
[12] │ <ad>d
[13] │ <ad>dr<es>s
[14] │ <ad>m<it>
[15] │ <ad>v<er>t<is>e
[16] │ <af>f<ec>t
[17] │ <af>f<or>d
[18] │ <af>t<er>
[19] │ <af>t<er>no<on>
[20] │ <ag>a<in>
[21] │ <ag>a<in>st
... and 924 more
# g. Contain at least two vowel-consonant pairs in a row.
str_view(words, "([aeiou][^aeiou]){2,}")
  [4] │ abs<olut>e
 [23] │ <agen>t
 [30] │ <alon>g
 [36] │ <americ>a
 [39] │ <anot>her
 [42] │ <apar>t
 [43] │ app<aren>t
 [61] │ auth<orit>y
 [62] │ ava<ilab>le
 [63] │ <awar>e
 [64] │ <away>
 [70] │ b<alan>ce
 [75] │ b<asis>
 [81] │ b<ecom>e
 [83] │ b<efor>e
 [84] │ b<egin>
 [85] │ b<ehin>d
 [87] │ b<enefit>
[119] │ b<usines>s
[143] │ ch<arac>ter
... and 149 more
# h. Only consists of repeated vowel-consonant pairs.
str_view(words, "^([aeiou][^aeiou])\\1+$") # No matches, let's try a positive match

# Test with known patterns
pattern <- "^([aeiou][^aeiou])\\1+$"
pos <- c("anananan", "erer")
neg <- c("nananana", "erere", "ananerer")

str_view(pos, pattern)
[1] │ <anananan>
[2] │ <erer>
str_view(neg, pattern)
# It seems to work, there are no words with this pattern in stringr::words

With these exercises, we’ve demonstrated how regular expressions can be used to manipulate and search text data in R. Whether you are searching for specific patterns or validating text data, regex provides a robust solution for your string processing needs.

See you soon in Part II of these exercises.

Happy regexing!

References

  • R for Data Science (2ed), written by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. https://r4ds.hadley.nz/

  • Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.