Category Archives: regex

It’s All Greek to Me: Creating My Own Regex Writer

Link to the code on Github: utils_regex.R

 

Executive Summary:

I developed a library of trivial but useful regex-writing functions that make normally painful expressions faster to write and easier to read. I expanded the suite of typical regex functions to include others I wished had existed all along, mostly for reducing all the boilerplate code that comes along with certain types of expressions. I like using these functions because they make writing regex faster, reading easier, and debugging much simpler.

 

Motivations:

Regular expressions often look like chicken scratch to programmers who didn’t write those specific expressions themselves. After working with them frequently, I find them relatively straightforward to write but still unfortunately painful to read and understand. I created this suite of functions that build up regular expressions in easy-to-understand blocks so that other programmers who look at my code (including future-me) can easily understand what and how I was getting at with these expressions.

 

To start, why is there no simple regex remover function? Sure, you can write re.sub with repl equal to the empty string (gsub(replacement = “”) for the R programmers), but why all the boilerplate? Also, why are the patterns always written first, when the strings it will act on (especially given R’s piper) would make more sense? Well…

 

rem(strings, pattern, …) is a single substitution with an empty string. grem is the gsub version of that.

 

If I want to remove multiple things or do multiple substitutions from/on a list/vector of strings, do I really have to chain the expressions together (re.sub(re.sub(re.sub(re.sub(to infinity and beyond!)))) until the stack overflows? Or worse yet, copy-paste nearly the same line many times in a row with a new or identical variable name each time? Nope.

 

grems(), subs(), gsubs(), greps(), grepls(), regexprs(), and gregexprs() (the “s” is just indicating the plural form) do exactly that, but with a built in for loop to further reduce boilerplate your eyes don’t need when you’re already looking at regex. subs() and gsubs() have the added benefit of using a single named vector in R, so “USA” = “United States” would turn “United States” into “USA”. If you’re staring with two separate vectors, just rename the patterns with the replacements.

 

Do you have a set/list/vector of expressions you’d all like to test simultaneously? Just wrap it inside any_of(), which will make the “(x|y|z)”-like construction for you. It’s most useful if you have multiple nested or-bars.

 

Does finding a word need to be as ugly as “\\bword\\b”? I’ve lost count of the number of times I or an error message has caught myself having written “\\bob\\b” when I mean “\\bbob\\b” (the word bob), for instance. word(“bob”) does that.

 

If you’re removing certain words, you’ll often end with hanging punctuation that’s painful to remove. Why not combine all that into one step?

 

Removing everything that occurs before or after (but not including) some highly repetitive set of characters can sometimes cause catastrophic backtracking and other related problems, so I’ve also created some functions that make that same process easier and faster (by providing a few better, proper lines to avoid the one-line sub you’re/I’m liable to write on a deadline) while keeping a clean, unintrusive appearance.