How to Process Textual Data Concisely, Efficiently, and Robustly in R

Links to code on Github: utils_factors.R

Executive Summary:

I created a library of efficient and robust level-editing functions in R that allow for elegant formatting and handling of string data in the form of factors. I created these out of need for cleaning textual data for my Shiny project, but have found them so useful that I regularly use them on textual data in other projects as well.

Motivations:

The impetus for creating this library of functions was how inefficient some factor-handling processes are in base R (and even with some useful packages designed for this work, such as Hadley’s forcats package). (For the Pythonians among us, R’s factors are analogous to pandas’ categoricals, and the levels of a factor (a vector of type factor) are the different categories that values can take.) With the need to format tens of columns of millions of rows of strings within a reasonable time frame for iterative improvement of the code I was writing, I simply couldn’t afford to wait half an hour every time I wanted to test if my new data pipeline changes had been successful or not. So I sought out ways to do more efficient processing of factors in R. At this point, I had already discovered the biggest bottleneck in my textual processing pipeline, which had been neglecting to use factors in the first place (instead, processing each unique string as many times as it appeared in the dataset). But I looked deeper and found a programming quandary begging to be solved. And beyond its relationship to factor handling in R, what I discovered has reshaped my approach to processing textual data in general.

Initially I had simply tested the new pipeline on small aliquots of data (say, 1% of the total), which helped, but strings, unlike numerical data, are less predictive in how they will respond to processing (a function may show issues for classes of numbers, such as negative numbers, numbers close to 0, or very large numbers, but there are simply too many different kinds of character strings). Furthermore, processing only 1% at a time limited my attempted fixes to roughly 1% of the errors at a time, as many of the errors were unique. Profiling didn’t lead me anywhere interesting, as it showed that some of the base R operations were slowing things down. Surely you can’t make things any faster without digging into C/C++, right?

Well, yes and no. I realized data.table’s speed improvements over many base R or Hadley-verse equivalents (dplyr and the like) come not only from being written in C but also by setting values by reference. That is, instead of forming a brand new dataframe in memory each time it’s modified, just modifying the original copy (already in memory) instead. R’s typical copy-on-modify semantics are exactly what you’d want for exploratory data analysis, wherein corrupted or lost data upon one accidentally imperfect exploratory query is the last thing you’d want. But the memory and processing overhead for making a copy every single time data are modified is a high price to pay in the context of a data pipeline (just make one copy first thing and forget about it until you want to process everything from scratch again). I did some tests and found that R’s slow factor handling was not due to the cardinality of the data (i.e. actually editing the factor levels as I was) but by the pure size of the data (i.e. from copying new data before each successive modification).

So I set about designing a set of R functions that would facilitate directly modifying factor levels by reference using data.table’s ability to directly edit column attributes. And aside from simply accessing the inner machinery of the data tables, I saw it as an opportunity to build a fully developed and idiomatic library centered around efficient factor level modification. A few of the design features I worked up to over time are uniformity in structure and use across the set of functions, placement of the vector of factors as the first argument to allow use with R’s piper (%>%), and invisible return of the edited factor vector to allow chained piping for an intuitive relationship between code structure and function. Below I’ll review what each function does.

format_levels(fact, func, …)

Replaces the factor levels with those same levels as modified by a function. Very useful for formatting text, like capitalizing entries. Now intuitive code like things_to_capitalize %>% format_levels(capitalize) is paired with extremely fast performance.

format_similar_levels(fact, pairings, …)

Same as the above, but processes the levels with an entire set of functions paired with a regex pattern for determining which specific levels get altered by which functions. Let’s say you want to capitalize only certain entries based on their content, that could be entry_data %>% format_similar_levels(“^association” = capitalize), which would capitalize all levels starting with “association”.

replace_levels(fact, from, to)

Sometimes you’d just like to replace a single (or multiple) levels with specific new value. Countries %>% replace_levels(from = “PRC”, to = “China”) is an example of that.

rename_levels(fact, changes)

Same as the above, but using a named vector instead, so the example would be countries %>% rename_levels(“China” = “PRC”). The new values are the names of the changes vector so that you can drop any unnamed levels to the empty string with an expression like countries %>% rename_levels(“unknown”).

rename_similar_levels(fact, changes, exact)

Same as the above, but using regex instead, so countries %>% rename_similar_levels(“Un.” = “^United “) would abbreviate all countries starting with “United” to “Un.” instead.

add_levels(fact, add)

Initialize a (currently-unrepresented) level. Responses %>% add_levels(“wants Python to start using brackets”) would allow a new bar of 0 height to be shown for the number of people who want Python to start using braces/brackets like everyone else.

drop_levels(fact, drop, to)

Makes these (probably unimportant) levels combine into some other (default empty string) category. Data %>% drop_levels(c(“unk.”, “unspecified”), to = “unknown”) turns both “unk.” and “unspecified” levels into “unknown”.

drop_similar_levels(fact, drop, to, exact)

Same as above, but with regex using a named vector, as with before.

drop_missing_levels(fact, to)

Combines all unrepresented levels into one (default empty string) level.

keep_levels(fact, keep, to)

Drops/combines all levels except for those specified in keep.

keep_similar_levels(fact, keep, to, exact)

Same as above, but with regex using a named vector, as with before.

reduce_levels(fact, rules, other, exact)

Decides which levels to drop/combine/otherize to the string specified in other (defaut “other”) based on regex.

otherize_levels_rank(fact, cutoff, other, otherize_empty_levels, include_ties)

Decides which levels to drop based on which ones are represented below a certain cutoff in rank (of frequency). Exact function behavior can be modified using the last two arguments, which are booleans.

otherize_levels_prop(fact, cutoff, other, otherize_empty_levels)

Same as above, except based on a proportion as cutoff (e.g. combine all levels that individually account for less than 1% of values).

otherize_levels_count(fact, cutoff, other, otherize_empty_levels)

Same as above, except based on a hard number cutoff.

And then there is a set of functions with the same names except ending with _by_col, which instead of taking a data.table’s single column vector and applying the functions, instead takes an entire data.table and applies the functions to each column (or a named subset of them).

This set of factor-editing functions does not technically have data.table as a dependency (it finds a way to do everything in base R without data.table if you don’t have it installed, since the functions do provide a convenient interface for working with factors regardless of the speed improvements), but it’s much faster using data.table’s set by reference.

There’s one more complicated function that does require data.table that carefully fills in missing data across two columns that are redundant or otherwise related (e.g. a country code and a country name). I initially just had it in mind for my Shiny project data’s code:value redundant structure, but I found I could also expand it to validate and fill in city:country (and other similar) relationships as well, as any one-to-one or many-to-one relationship will work. I could write an entire new article on this function alone, though, so I will cut things off here.

If you do any work with factors in R on large datasets, I’d try seeing what kind of performance (and simplicity) improvements you can achieve with data.table paired with this package.

Searching for 伯樂

Explorations in data science

How to Process Textual Data Concisely, Efficiently, and Robustly in R

Related