Category Archives: efficiency

Transcribing R to idiomatic Python: Code that Edits Code

Links to code on Github: code transcriber and resulting Python code

Executive Summary:

I wrote code that transcribes my (native) R lemmatizer into Python for wider use. As the lemmatizer performs the same few basic operations many times, I was able to teach the transcriber how to transcribe each type of statement into idiomatic (Pythonic) Python, making it usable, readable, and efficient in both languages. In this way I’ve halved my need to update and debug code, saving myself much time and many headaches.

Motivations:

After developing my lemmatizer in R, I wanted to use it in conjunction with other NLP packages in Python. Keeping in mind that I’d often make small changes in my lemmatizer code in R, I wanted some way to easily and safely both generate and maintain the Python cousin of the same set of code. Working on code in multiple languages is a part of many team projects, and I wanted to get into the habit of never duplicating myself, as writing the same programing idea in two languages from scratch is a waste of resources.

The idea is simple to conceive but (as I found) somewhat difficult to thoroughly execute, especially defensively to unforeseen changes. Some language aspects (R’s use of brackets and default two-space indentation vs Python’s use of colons and four-space indentation) are relatively straight-forward to fix with a few well-tuned regular expressions (regex). Other aspects are much more difficult to properly handle: among these are Python’s basis on lists vs R’s basis on vectors, handling missingness across two languages that differ widely in how missingness is encoded, translating carefully-tuned string operations into Pythonic Python (both for the sake of readability and performance), and transcribing statements written over more than one line. I’ll address these in order below.

Python’s lists allow a kind of freedom that R’s vectors don’t have, such as handling multiple data types and also allowing nesting (similar to R’s lists in these respects). The better analogue to R’s vectors in Python is numpy’s arrays, and I naturally made extensive use of these for efficient vectorized calculations on input words. However, for convenience to the user (including myself) I forced myself to accept the most common structure as input, which for Python is still the list, handled with one simple customized list comprehension ensuring type safety and flatness. R makes extensive use of its combine c() method for creating vectors, which has the unique property of ensuring flatness. I had to use a few other customized bits of code to ensure flatness after some of the vector combinations I had written in R.

Numpy strings do not natively support missingness. Wherein my R code made use of boolean missingness to denote values neither true nor false (e.g., some words like sheep are neither plural nor singular), I had to work around this by simply considering things singular if they do not need to be made singular for the lemmatizer to work. It would be possible to keep a separate boolean numpy array for missingness, but that seemed excessive when a simple solution such as the above worked fine (after all, the complicated solution would not be very Pythonic).

Translating idiomatic R (such as that with extensive use of infix operators) into idiomatic Python (list comprehensions, tuples, and such) required a more fine-tuned approach, and such transcriptions needed to be baked into the transcriber due to their extensive use. I translated most of R’s paste() statements into list comprehensions with simple string concatenation. I found that while Python strings conveniently support using a tuple as the argument in the endswith() method (to check if the string ends with multiple endings), numpy.chararray only supports single strings (and numpy.ndarray does not support the endswith() method at all), also necessitating the creation of more list comprehensions. I had to take many things into account when moving arguments around (e.g. from either infix operator position in R to the object-like position in Python), including parentheses, operators and their order of operations, and non-coding character strings (either in literal, quoted strings or comments).

Additionally, I taught the transcriber how to evaluate arguments on separate lines to handle my longer, complicated vectorized logical computations. Adding the trailing backspace for Python was easy; identifying whole argument chunks and moving them around appropriately was not.

After running the code transcriber, the output Python code only requires a handful of manual changes—which can be copy-pasted identically each time as they are the most fundamental and language-dependent processes and least likely to change—to function identically to its R counterpart.

Future developments could further reduce the need for the remaining few manual changes, though some are likely infeasible (or prohibitively burdensome) to handle in Python (e.g. non-standard evaluation, which is very idiomatic and unique to R). The pluralization functions are readily transcribable as well once I remove their dependency on non-standard evaluation, which is a convenient but not essential aspect of their function in R. I’d also like to further improve the readability of the code in ways that don’t significantly impact performance, and plan to explore subclassing numpy.chararray to clear up some boilerplate verbiage among the methods (especially when it comes to testing element-wise if a string matches any of a few targets).

How to Process Textual Data Concisely, Efficiently, and Robustly in R

Links to code on Github: utils_factors.R

Executive Summary:

I created a library of efficient and robust level-editing functions in R that allow for elegant formatting and handling of string data in the form of factors. I created these out of need for cleaning textual data for my Shiny project, but have found them so useful that I regularly use them on textual data in other projects as well.

Motivations:

The impetus for creating this library of functions was how inefficient some factor-handling processes are in base R (and even with some useful packages designed for this work, such as Hadley’s forcats package). (For the Pythonians among us, R’s factors are analogous to pandas’ categoricals, and the levels of a factor (a vector of type factor) are the different categories that values can take.) With the need to format tens of columns of millions of rows of strings within a reasonable time frame for iterative improvement of the code I was writing, I simply couldn’t afford to wait half an hour every time I wanted to test if my new data pipeline changes had been successful or not. So I sought out ways to do more efficient processing of factors in R. At this point, I had already discovered the biggest bottleneck in my textual processing pipeline, which had been neglecting to use factors in the first place (instead, processing each unique string as many times as it appeared in the dataset). But I looked deeper and found a programming quandary begging to be solved. And beyond its relationship to factor handling in R, what I discovered has reshaped my approach to processing textual data in general.

Initially I had simply tested the new pipeline on small aliquots of data (say, 1% of the total), which helped, but strings, unlike numerical data, are less predictive in how they will respond to processing (a function may show issues for classes of numbers, such as negative numbers, numbers close to 0, or very large numbers, but there are simply too many different kinds of character strings). Furthermore, processing only 1% at a time limited my attempted fixes to roughly 1% of the errors at a time, as many of the errors were unique. Profiling didn’t lead me anywhere interesting, as it showed that some of the base R operations were slowing things down. Surely you can’t make things any faster without digging into C/C++, right?

Well, yes and no. I realized data.table’s speed improvements over many base R or Hadley-verse equivalents (dplyr and the like) come not only from being written in C but also by setting values by reference. That is, instead of forming a brand new dataframe in memory each time it’s modified, just modifying the original copy (already in memory) instead. R’s typical copy-on-modify semantics are exactly what you’d want for exploratory data analysis, wherein corrupted or lost data upon one accidentally imperfect exploratory query is the last thing you’d want. But the memory and processing overhead for making a copy every single time data are modified is a high price to pay in the context of a data pipeline (just make one copy first thing and forget about it until you want to process everything from scratch again). I did some tests and found that R’s slow factor handling was not due to the cardinality of the data (i.e. actually editing the factor levels as I was) but by the pure size of the data (i.e. from copying new data before each successive modification).

So I set about designing a set of R functions that would facilitate directly modifying factor levels by reference using data.table’s ability to directly edit column attributes. And aside from simply accessing the inner machinery of the data tables, I saw it as an opportunity to build a fully developed and idiomatic library centered around efficient factor level modification. A few of the design features I worked up to over time are uniformity in structure and use across the set of functions, placement of the vector of factors as the first argument to allow use with R’s piper (%>%), and invisible return of the edited factor vector to allow chained piping for an intuitive relationship between code structure and function. Below I’ll review what each function does.

format_levels(fact, func, …)

Replaces the factor levels with those same levels as modified by a function. Very useful for formatting text, like capitalizing entries. Now intuitive code like things_to_capitalize %>% format_levels(capitalize) is paired with extremely fast performance.

format_similar_levels(fact, pairings, …)

Same as the above, but processes the levels with an entire set of functions paired with a regex pattern for determining which specific levels get altered by which functions. Let’s say you want to capitalize only certain entries based on their content, that could be entry_data %>% format_similar_levels(“^association” = capitalize), which would capitalize all levels starting with “association”.

replace_levels(fact, from, to)

Sometimes you’d just like to replace a single (or multiple) levels with specific new value. Countries %>% replace_levels(from = “PRC”, to = “China”) is an example of that.

rename_levels(fact, changes)

Same as the above, but using a named vector instead, so the example would be countries %>% rename_levels(“China” = “PRC”). The new values are the names of the changes vector so that you can drop any unnamed levels to the empty string with an expression like countries %>% rename_levels(“unknown”).

rename_similar_levels(fact, changes, exact)

Same as the above, but using regex instead, so countries %>% rename_similar_levels(“Un.” = “^United “) would abbreviate all countries starting with “United” to “Un.” instead.

add_levels(fact, add)

Initialize a (currently-unrepresented) level. Responses %>% add_levels(“wants Python to start using brackets”) would allow a new bar of 0 height to be shown for the number of people who want Python to start using braces/brackets like everyone else.

drop_levels(fact, drop, to)

Makes these (probably unimportant) levels combine into some other (default empty string) category. Data %>% drop_levels(c(“unk.”, “unspecified”), to = “unknown”) turns both “unk.” and “unspecified” levels into “unknown”.

drop_similar_levels(fact, drop, to, exact)

Same as above, but with regex using a named vector, as with before.

drop_missing_levels(fact, to)

Combines all unrepresented levels into one (default empty string) level.

keep_levels(fact, keep, to)

Drops/combines all levels except for those specified in keep.

keep_similar_levels(fact, keep, to, exact)

Same as above, but with regex using a named vector, as with before.

reduce_levels(fact, rules, other, exact)

Decides which levels to drop/combine/otherize to the string specified in other (defaut “other”) based on regex.

otherize_levels_rank(fact, cutoff, other, otherize_empty_levels, include_ties)

Decides which levels to drop based on which ones are represented below a certain cutoff in rank (of frequency). Exact function behavior can be modified using the last two arguments, which are booleans.

otherize_levels_prop(fact, cutoff, other, otherize_empty_levels)

Same as above, except based on a proportion as cutoff (e.g. combine all levels that individually account for less than 1% of values).

otherize_levels_count(fact, cutoff, other, otherize_empty_levels)

Same as above, except based on a hard number cutoff.

And then there is a set of functions with the same names except ending with _by_col, which instead of taking a data.table’s single column vector and applying the functions, instead takes an entire data.table and applies the functions to each column (or a named subset of them).

This set of factor-editing functions does not technically have data.table as a dependency (it finds a way to do everything in base R without data.table if you don’t have it installed, since the functions do provide a convenient interface for working with factors regardless of the speed improvements), but it’s much faster using data.table’s set by reference.

There’s one more complicated function that does require data.table that carefully fills in missing data across two columns that are redundant or otherwise related (e.g. a country code and a country name). I initially just had it in mind for my Shiny project data’s code:value redundant structure, but I found I could also expand it to validate and fill in city:country (and other similar) relationships as well, as any one-to-one or many-to-one relationship will work. I could write an entire new article on this function alone, though, so I will cut things off here.

If you do any work with factors in R on large datasets, I’d try seeing what kind of performance (and simplicity) improvements you can achieve with data.table paired with this package.

Don’t Know Much About History: Visualizing the Scale of Major 20th Century Conflicts (Details)

Check out the code here while you read the article.

Executive Summary:

I used advanced programming features of R to make cleaning and organizing the data possible, especially in an efficient and highly iterative manner. I further used closures, function factories, and multiple layers of abstraction to make the code more robust to changes/refactorization and much easier to understand.

A general overview of the project is available here. (separate blog post)

This project was a challenge of programming fundamentals: execute a complicated task reliably in minimal time with readable code. Along the way I discovered some of the essential features in R—including R’s superfast data.table, unit testing methods and other features from the “Hadleyverse”/tidyverse, and built-in functional programming aspects—and wrote other features myself, such as efficient string editing through factors and conveniences allowed by R’s “non-standard evaluation” (for more reference on this see Hadley Wickham’s Advanced R).

First, an ode to data.table: it’s arguably the fastest data wrangling package out there, at least among the top data science languages like Python and R. It’s ~2x faster at common tasks than Python’s Pandas (and unspeakably faster than R’s native data.frame). I also like to think it has better syntax, as it’s written more like a query than a series of method calls. It allows value setting by reference, which prevents copying of the whole dataset just for slight modifications. It provides both imperative := and functional `:=`() forms, and even a for loop-specific form set() for those few times when one-line solutions aren’t enough. It allows easy control of output format (whether in vector or data.table form) and provides the convenience of non-standard evaluation without removing the flexibility of standard evaluative forms. It is tremendously flexible, expressive, powerful, and concise, which makes it an excellent choice for any data wrangling project. It really shines with larger (≥1GB) datasets, especially with its parallelized read and write methods, and enabled me to iterate quickly when designing my data cleaning pipeline for edge cases in the entire datasets I set out to process.

Functional programming was essential to staying DRY (Don’t Repeat Yourself—minimizing duplication) in this project. Having four similar datasets that were best kept separate to operate on, I created function factories for creating closures that would process each dataset as appropriate, and further bundled the closures together so one could essentially call a function on a dataset and get the expected results, despite the underlying implementation for each dataset perhaps differing from others. I wrote this program with a scalable design, as I had initially planned to start with just World War Two data and expand it once working with that data to also include the other datasets you can see in the final product. If my data source releases any more datasets, I’ll easily be able to fit them into the current framework as well.

The size of the dataset and the intricacy of the code provided an opportunity for automated unit testing to catch issues early. As I had to iterate through various designs of the data cleaning code many times to eliminate each separate issue I noticed, unit testing caught any unintended side-effects as soon as they happened and ensured that obscure data cleanliness issues didn’t rear their heads many steps later. Hadley’s testthat package proved very useful for this.

The raw string data presented in an inconsistent and hideous format. In order for tooltips to appear well-formatted to a human reader, I made the textual data conform to standard conventions of punctuation and capitalization. The extensive presence of proper nouns in particular required a fine-tuned approach that simple regular expressions couldn’t solve, so I created string formatting methods that I sought to apply to each row in certain columns. However, initial versions of my string formatting code, though successful, took several hours (half a day) to run, requiring that I rethink the programming or computer science aspects of my approach. Some vectorized (or easily vectorizable) methods only took a few minutes each to peel through a column of my largest dataset with over 4.5 million observations. So I figured out a way to effectively vectorize even the seemingly most unvectorizable methods. For instance, instead of splitting each row’s text into words and rerunning the word-vectorized proper noun capitalization method on all rows in the dataset, I discovered it was significantly more efficient (in R) to unlist the list of words in each row into a single large character vector (containing all words in all rows, one after another, with no special separation between words in a particular row and words in the next row), run the proper noun capitalization on that single long character vector, and then repartition the (now capitalized) words back into a list of words in each row. This single change improved the efficiency of the data formatting by nearly a factor of 10. Still, it was taking over an hour to run, and I investigated further ways to improve efficiency. I realized in particular that the Unit_Country field was editing the same ~5 strings in the same way over and over again for each row (of over 4.5 million rows), which was horribly inefficient. Much better would be to map each row’s string to a short table of possible string values, apply the edits to the reference table, and then unmap or reference the updated values as necessary. This is (more or less) exactly what factors do! Applying the same string formatting methods to the factor levels brought the total runtime down to a few minutes, which made fast iteration through data cleaning method variants easy. Still though, the methods seemed to take unusually long (over one second per operation) to update just a few string values, so I investigated further and discovered that R was copying the entire dataset each time a factor’s levels were altered (even using Hadley’s forcats package). I then found a way to update a factor’s levels in place by reference (thanks again to data.table), which brought the total runtime down to less than a minute, for a total efficiency improvement of over 1,000x. This efficiency improvement was essential for using iterative design to get the labels just right. Extensive use of the pipe (%>%) made writing the code substantially clearer and also makes it substantially easier to read, often obviating comments.

Additionally, I created a reusable fill_matching_values() method that fixes inconsistencies in one-to-one code-to-value mappings such as those found in the datasets for this project. Many of the values appeared as though they had been entered through Optical Character Recognition (OCR) technology, and there were plenty of labeling mistakes that needed fixing. The algorithm identifies the most common mapped value for each code and replaces all mapped value with that presumably correct modal (most common) value; analogously, the algorithm then fills in missing and corrects incorrect codes using the most common code that appears for each value. I first coded up a working form of this algorithm using basic data.table syntax and then vastly improved its efficiency (~10-fold) by implementing the lookup table using data.table’s more advanced, speed-optimized join syntax. Furthermore, I easily filtered the intermediate lookup table for duplicate or conflicting mappings, allowing my human eyes to double-check only the mistaken mappings and their corrections. (And I’m glad I did: I found a few misspellings that were more common than the correct versions, which were easily fixed manually in my data cleaning pipeline.)

Stylistically, the code still had a long way to go. I found myself repeatedly performing the same few operations on each column’s factor levels: dropping levels, renaming levels based on a formula, renaming specific levels with specific new names, renaming levels through regular expressions, grouping rare levels together under an “other” label, and running the same level formatting on multiple different columns. I created helper functions, almost like a separate package, for each of these situations. The formula-based level renaming function is an example of a functional: pass it a function, and it gives you back values (the edited levels). Different operations could be applied to the same column/vector by invisibly returning the results of each operation and using chaining.

Creating compositions of data cleaning functions greatly simplified the writing process and improved the readability of my code. I defined a functional composition pipe %.% (a period, or “dot”) mimicking the syntax in fully functional programming languages like Haskell, but found that my reverse functional composition pipe %,% (a comma) made the code even more readable. (Instead of f() %.% g(), representing f(g()), which places the first-executed operation last on the line, I sided with f() %,% g(), representing g(f()), which has the functions execute in the same order in which they appear in the line of code, separated by commas, as if to suggest “first f(), then g()”.)

Other infix functions simplified common tasks in R:

%||% provides a safe default (on the right) when the item on the left is null.

%[==]% serves a dual purpose: when a[a == b] is intended, it prevents writing a textually long expression a twice and/or prevents unnecessary double evaluation of a computationally expensive expression a.

The related %[!=]%, %[>]%, %[<]%, %[>=]%, %[<=]% are self-explanatory given the above.

%[]% analogously simplifies the expression of a[a %>% b], using non-standard evaluation and proper control of calling frames (the environment in which to evaluate the expression).

%c% (the c looks like the set theory “is an element of” symbol) creates a safe version of the %in% operator that assumes that the left-side component is a scalar (vector of length 1), warning and using only the first element of a longer vector if otherwise.

I also fixed a subtle bug in the base R tabulate() method as applied to factor vectors: when the last level or last few levels are not present in the vector, it completely drops them from its results. Under the hood, the default tabulate method evaluates factors from the first level to the level with the highest ordinal value (level indices 1 through max(int(levels)), reporting missing levels only when they’re not last in the list of levels.

I explored using parallel processing for the multicolumn data.table operations using mclapply() instead of lapply(), but the former was slower in every instance across every possible number of execution threads. I gained a small performance benefit from using the just-in-time compiler.

Finally, onto the finished product:

On the main (overview) tab, I plot out aerial bombing events on a map. Extensive controls allow the user to select which data to plot, filter the data by several criteria, and alter the appearance of the map to highlight various aspects of the data. Above the map, informational boxes display summations of various salient features of the data for quick numerical comparisons between conflicts and data subsets.

I allow the user to filter each dataset by six criteria, providing a comprehensive exploratory approach to interacting with the data: dates (Mission_Date), the target country (Target_Country), the target type (Target_Type), the country of origin (Unit_Country), the type of aircraft flown (Aircraft_Type), and the type of weapon used (Weapon_Type). In order to allow speedy performance when filtering millions of rows based on these arbitrary criteria on the fly, I set these columns as keys for each dataset, which causes data.table to sort the rows such that binary search may be used (as opposed to searching through the entire dataset). Furthermore, I wrote a full-structured method that queries the datasets using the minimum criteria specified, saving time. I update the possible choices for each filtration criterion based on the unique choices (levels) that still remain in the filtered datasets. Since this latter case involves performing an expensive and deterministic operation that returns a small amount of data—and potentially executing the same query multiple times—it is a perfect candidate for improvement through memoization, which I performed effortlessly with the memoise package.

The maps utilize a few features to better visualize points. A few different map types are available (each better at visualizing certain features), and map labels and borders may individually be toggled on or off. If the filtered data for plotting contains too many rows, a random sample of the filtered data is used for plotting a subset of the points. The opacity of points depends logarithmically on the number currently plotted (creating a plot of more numerous and more transparent points or more sparse and more opaque points). The opacity also depends linearly upon the zoom level of the map, so points get more opaque as one zooms in. The radius marks the relative damage radius of the different bombings (an approximation based on the weight of the explosives dropped, not based on historical data). The radii stay the same absolute size on the screen—regardless of zoom level—until one zooms in far enough to see the actual approximate damage radius, which I calculated based on the weight (or equivalent weight in TNT) of the warhead(s).

As hinted at earlier, each data point has a corresponding tooltip that displays relevant information in an easy-to-read format. (Each tooltip appears only when its dot on the map is clicked so as to not crowd out the more important, visually plotted data.) I included links to the Wikipedia pages on the aircraft and city/area of each bombing event for further investigation. (The links, pieced together based on the standard Wikipedia page URL format and the textual name of the aircraft or cities/areas, work in the majority of cases due to Wikipedia’s extensive use of similar, alternative URLs that all link to a given page’s standard URL. The few cases in which the tooltip’s link reaches a non-existent page occur either—most commonly—when an associated page for that aircraft or city/area has not been created, or when the alternative URL does not exist or link properly.)

The data are also displayable in a columnar format (in the data tab) for inspection of individual points by text (sortable by the data’s most relevant attributes).

Each dataset has its own tab (named for the conflict that the dataset represents) that graphs salient features of the data against each other. These graphs include a histogram and a fully customizable “sandbox” plot. The same filters (in the left-most panel) that the user has chosen for the map-based plots *also* apply for the graphical plots on these tabs, allowing the user to investigate the same subset of data spatially (the map plot), temporally (the histogram), and relationally (the “sandbox” plot).

The histogram and “sandbox” plot for each dataset are generated through function factories that create the same plots (with some minor dataset-specific differences, such as choices for categorical and continuous variables) for each dataset despite their differences, providing a uniform interface and experience for disparate datasets. Furthermore, the histogram and sandbox plot are responsive to user input, creating the clearest possible graph for each choice of variables and options. For example: choosing a categorical variable for the independent/horizontal axis creates a violin plot with markings for each quartile boundary; choosing a continuous variable creates a scatter plot with a line of best fit.

For those interested in investigating the historical events themselves (as opposed to their data), I’ve included tabs that relate to these events through three different perspectives: that of a pilot, that of a commander, and that of a civilian. The pilot tab shows interesting historical footage—some documentary-style, and some genuine pilot training videos from the time—related to each conflict. The commander tab shows the progression and major events of each conflict on printed maps. The civilian tab shows, using a heatmap, the intensity of the bombings over the full courses of the wars.

Throughout this project, I came to appreciate the utility of multiple layers of abstraction for effective program design. Throughout the project, I have used layers of abstraction to mirror structurally the way a human would think about organizing and implementing each aspect of such a project. In this way, I’ve saved myself a great amount of time (in all stages of the project: writing, refactoring, testing, and deploying), made the program more resilient to change, and saved the reader (including my future self) a magnitude of headaches when poring through the code.

For one, the UI itself is laid out very simply in nearly WYSIWYG (what you see is what you get) form using abstracting place-holder functions that represent entire functional graphical components. The codes themselves are organized extensively into several files beyond the prototypical ui.R, server.R, and global.R used in most Shiny apps. This made (for me) and makes (for the reader) finding the appropriate code sections much easier and more efficient (for instance, by localizing all parameters in a parameters.R file, and further grouping them by function, instead of having them spread out through multiple other files, or—even worse—hard-coding them), and also comes with the added benefit of allowing some more generalizable code sections to be easily repurposed in other projects in the future.

It’s All Greek to Me: Creating My Own Regex Writer

Link to the code on Github: utils_regex.R

Executive Summary:

I developed a library of trivial but useful regex-writing functions that make normally painful expressions faster to write and easier to read. I expanded the suite of typical regex functions to include others I wished had existed all along, mostly for reducing all the boilerplate code that comes along with certain types of expressions. I like using these functions because they make writing regex faster, reading easier, and debugging much simpler.

Motivations:

Regular expressions often look like chicken scratch to programmers who didn’t write those specific expressions themselves. After working with them frequently, I find them relatively straightforward to write but still unfortunately painful to read and understand. I created this suite of functions that build up regular expressions in easy-to-understand blocks so that other programmers who look at my code (including future-me) can easily understand what and how I was getting at with these expressions.

To start, why is there no simple regex remover function? Sure, you can write re.sub with repl equal to the empty string (gsub(replacement = “”) for the R programmers), but why all the boilerplate? Also, why are the patterns always written first, when the strings it will act on (especially given R’s piper) would make more sense? Well…

rem(strings, pattern, …) is a single substitution with an empty string. grem is the gsub version of that.

If I want to remove multiple things or do multiple substitutions from/on a list/vector of strings, do I really have to chain the expressions together (re.sub(re.sub(re.sub(re.sub(to infinity and beyond!)))) until the stack overflows? Or worse yet, copy-paste nearly the same line many times in a row with a new or identical variable name each time? Nope.

grems(), subs(), gsubs(), greps(), grepls(), regexprs(), and gregexprs() (the “s” is just indicating the plural form) do exactly that, but with a built in for loop to further reduce boilerplate your eyes don’t need when you’re already looking at regex. subs() and gsubs() have the added benefit of using a single named vector in R, so “USA” = “United States” would turn “United States” into “USA”. If you’re staring with two separate vectors, just rename the patterns with the replacements.

Do you have a set/list/vector of expressions you’d all like to test simultaneously? Just wrap it inside any_of(), which will make the “(x|y|z)”-like construction for you. It’s most useful if you have multiple nested or-bars.

Does finding a word need to be as ugly as “\\bword\\b”? I’ve lost count of the number of times I or an error message has caught myself having written “\\bob\\b” when I mean “\\bbob\\b” (the word bob), for instance. word(“bob”) does that.

If you’re removing certain words, you’ll often end with hanging punctuation that’s painful to remove. Why not combine all that into one step?

Removing everything that occurs before or after (but not including) some highly repetitive set of characters can sometimes cause catastrophic backtracking and other related problems, so I’ve also created some functions that make that same process easier and faster (by providing a few better, proper lines to avoid the one-line sub you’re/I’m liable to write on a deadline) while keeping a clean, unintrusive appearance.

Searching for 伯樂

Explorations in data science

Category Archives: efficiency

Transcribing R to idiomatic Python: Code that Edits Code

How to Process Textual Data Concisely, Efficiently, and Robustly in R

Don’t Know Much About History: Visualizing the Scale of Major 20th Century Conflicts (Details)

It’s All Greek to Me: Creating My Own Regex Writer