Category Archives: NLP

Re-learning English: Pluralization and Lemmatization

Links to code on Github: the singularizer and pluralizer, and the lemmatizer


Executive Summary:

I created a word lemmatizer in R and Python that automatically “digests” words down to their core form for the purpose of dimensionality reduction for NLP (Natural Language Processing) tasks. The lemmatizer works by recognizing formulaic endings to English words, yet handle enough special cases that they outperform stemmers such as Porter and Snowball in accuracy. Because this lemmatizer relies on structured word endings, it is less dependent on external dictionaries than most lemmatizers such as WordNet, and as such, it performs nearly equally as well on real or invented words that do not appear in dictionaries (unlike WordNet).



My motivations starting this project were simple: dig my hands deeper into NLP (Natural Language Processing) and develop useful code for myself where other code wouldn’t (or wouldn’t quickly) work. What started as a quick exploratory fix on my Wikipedia Spark project then turned into a substantial sub-project that carefully handles English words; I will focus on the final product here.


There are two main camps for dimensionality reduction in the NLP word: stemming and lemmatization. Stemmers (usually) simply chop off endings in an attempt to eliminate relatively meaningless differences in phonology (e.g. does the analysis really depend on whether he walks or I walk? The key idea is walk in both cases). While useful for their speed and simplicity, they suffer from an overly simplistic approach to manipulating words that results in many disconnected similar words (e.g., multiply and multiplication stem to similar but distinct terms, a type II error of sorts) as well as many spuriously connected words (e.g. universal, university, and the made-up word universion all stem to univers, a type I error of sorts). Lemmatizers, on the other hand, seek to capture the “lemma” or core/root of a word by looking through relationships in dictionaries. With lemmatizers, phonologically difficult relationships like that between better and good are accurately captured, so they’re very useful when accuracy is needed (especially for reduction of type I errors/overstemming). However, with a lemmatizer, one often has to put in extra under-the-hood work to link together related words of different parts of speech, first detecting the part of speech (with error) and then traversing the web of derivationally related forms (and synsets and pertainyms, etc.) until one decides on a “root” form. Furthermore, words not captured by the dictionary can’t be handled properly, which is a more common problem than one would expect. For instance, what’s the relationship between carmaker and carmaking? WordNet, one of the most powerful and popular lemmatizers (used by the popular NLTK package), doesn’t know, as carmaking isn’t defined. Undefined words can be accurately processed by stemmers (the carmaking example is, for example) but pose a real problem for many lemmatizers. Clearly there is a middle ground between the overly general world of stemming and overly specific world of lemmatizing, so I set out to create a program that would find it.


English as a language is characterized by its decentralization and archaic spelling as well as its repeated (though occasionally inconsistent) use of patterns to generate inflectional forms. This fact requires a mix of dynamic/rule-based approaches (to generalize the inflectional system) as well as static/case-based approaches (to successfully handle legacy spellings and manage the confusing mixture of Latin, Greek, French, and Anglo-Saxon morphology that characterize English words).


So, I created the “word digester” to combine the best aspects of stemmers (simplicity/ease of use; usefulness on words left out of the dictionary) with the best aspects of lemmatizers (comprehensiveness; handling of special cases; readability of outputs). It will take any English-like word (newly invented or well-established) and digest it into its core form. It handles enough special cases to be useful on some of the more peculiar English word forms, yet it uses enough general rules to be useful without relying on a dictionary. It’s not prohibitively slow either (digesting a 20k-word dictionary in roughly 1-3 seconds in both R and Python). Admittedly, as a recent personal project of definite complexity, there are still plenty of gaps in its coverage of special cases, which can be fixed in time. Below I’ll highlight a few examples that highlight the capabilities of the word digester and also discuss the singularizer and pluralizer functions I created as a companion project to the digester.


The (real or invented) words multiple, multiples, multiply, multiplied, multiplies, multiplier, multipliers, multiplying, multiplication, multiplications, multiplicative, multiplicatives, multiplicativelymultiplicativitymultiplicativeness, multiplicationment, multiplicativement, antimultiplicative, unmultiplicative, nonmultiplicative, dismultiplicative, multiplicate, multiplicating, and potentially more all reduce to multiple.


It handles some of the most relevant Latinate or Greek plurals as well: symposia and symposiums both reduce to symposium, and lemmata and lemmas both reduce to lemma, for instance, and the pluralize function allows selective creation of either plural form in the reverse direction. In addition, hypotheses reduces to hypothesis, matrices to matrix, series to series, spectra to spectrum, phenomena to phenomenon, alumni to alumnus, and so on.


Anglo-Saxon-style forms are also covered: berries reduces to berry, cookies to cookie, leaves to leaf, children to child, oxen to ox, elder to old, (wo)men to (wo)man, people to person, brethren to brother, and so on.


Many irregular changes are covered: pronunciation reduces to pronounce, exemplify to example, redemption to redeem, tabular to table, operable (and the made-up but more regular-looking operatable) to operate, despicable to despise, grieve to grief, conception to conceive, flammable to flame, statuesque to statue, wholly to whole, shrubbery to shrub, and so on.


Sensible words outside the dictionary are also properly handled: both buttery as well as butterific reduce to butter, and planning as well as the E.U.-favored planification both reduce to plan. Feel free to make up your own words and try them; currently only some rules are sufficiently generalized, so while antidisestablishmentarianismesquely reduces to establish, I can’t promise that every made-up word will be processed correctly.


I’ve verified, semi-manually, the digester’s results on a few corpora: a 1k-word corpus, a 20k-word corpus, and a 370k-word corpus. No errors were made with the 1k corpus; nearly all of the 1200 or so purported errors (defined here as digested word results that do not also appear in the same source corpus) made on the 20k corpus are either—from most common to least common—errors on proper nouns (expressly not intended to be handled by the function as other methods exist for that purpose), real words that do not appear in that corpus (not an error), or genuine errors that I haven’t bothered to fix yet (estimated on the order of 10 or so). This implies a true error rate of around 1% or less on the 20k-word corpus. I have skimmed through the purported errors on the 370k-word corpus (~47k of them), and have similarly found that most of the purported errors either are on proper nouns, are not real errors at all, or are on exceptionally rare, esoteric, often medicinal or taxonomical words (as these make up most of the 370k-word corpus to begin with as a consequence of the Zipf distribution of word frequency).


How does it work? Take a peek at the code on Github to see for yourself—though it is long, it is well-commented and fairly self-explanatory. The basic process involves handling each of the many recognized suffixes in the order in which they almost universally appear in words. Special cases are handled before the best-guess set of general rules handle the suffixes. Masks are used to hide words that have suffix-like endings that should not be removed. One of the most difficult aspects of making the digester as accurate as possible was designing an effective regex test to determine whether an ultimate “e” should remain attached to a word after suffix removal. The suffixes are handled in groups, in which only relevant subsets of words (those that end with something that looks like the suffix in question) are handled for efficiency. Writing some aspects of this code in C would certainly greatly speed up computation, but there isn’t any need for that on most scales as is, as I’ve designed the code such that it does not scale directly with the size or length of the input, but rather with the cardinality of the input (i.e. it handles each unique word only once).


The order I’ve used for processing is:

  • singularize plural forms
  • remove contractions
  • fix irregular common words (all as special cases)
  • remove certain prefixes (with a mask to prevent similar non-prefix forms from being removed)
  • remove adverbial ly suffix
  • remove noun-form ness and ity suffixes
  • remove adjectival esque and ish suffixes
  • remove adjectival able and ible suffixes
  • remove noun-form hood suffix
  • edit miscellaneous Saxon- and French-style suffixes
  • remove istic, ist, and ism suffixes
  • remove adjectival al, ial, ical, ual, and inal suffixes
  • remove adjectival ian suffix
  • remove adjectival ary suffix
  • remove noun-form ment suffix
  • remove adjectival ic suffix
  • remove adjectival ous suffix
  • remove adjectival ful and less suffixes
  • remove adjectival ar and iar suffixes
  • edit noun-form ance and ence suffixes
  • remove adjectival ant and ent suffixes
  • remove adjectival ive suffix
  • remove noun-form age suffix
  • remove noun-form tion and ion suffixes
  • remove noun-form ery suffix
  • remove adjectival y suffix
  • remove noun-form itude suffix
  • edit noun-form ysis suffix
  • remove comparative and doer-form er suffix
  • remove superlative est suffix
  • remove verb-form ed suffix
  • remove verb-form ing suffix
  • remove verb-form ify suffix
  • remove verb-form ize suffix
  • remove verb-form ate suffix
  • fix miscellaneous verb forms to similar noun forms (e.g, multiply to multiple)


Re-Learning English multiple.png

What’s important to note is that each step along the way, the suffixes are removed such that a valid word is produced. In this way, the transformation pipeline functions such that it “collects” different derivationally related forms on their way toward the base form, analogously to how a river’s tributary system works. As such, any failure of the system to properly join one subset of derivationally related forms into further pathways toward the base form will leave that subset as a single valid (albeit not completely basal) word instead of creating something new and unrecognizable. Additionally, this design makes debugging and adding extra special cases easy, as each special case only need be handled once in order for all its derivationally related forms to also be successfully processed. Similarly, it’s straight-forward to force the system to not join together some branches if desired––the change need only be made in one place.



I initially coded up these functions in R, though wanting to expose it to wider use, especially within the NLP universe (which is very Python-centric), I transcribed the code into Python with the help of an R-to-Python transcription program I wrote myself. To learn more about that process, see this post.

“Fake News”: Detecting Bias on Wikipedia using NLP

Executive Summary:

My team partner (Rachel Kogan, a NYC Data Science Academy bootcamp classmate) and I filtered the Wikipedia English corpus down to bags of words using pySpark and trained machine learning models to detect bias (point of view tags) on TF-IDF data with nearly 90% accuracy.



The tide of news about fake news has reached a crescendo, and not many feasible solutions have been proposed for controlling the spread of biased information. The push for Facebook and other large social media sites on which most fake news seems to promulgate to manually editorialize their content is certainly infeasible for one. This problem seems ripe for machine learning approaches, specifically Natural Language Processing, so we set out to develop models that could detect bias in a large textual corpus. Due to its ease of access and pre-labeled content, we trained our models on the English Wikipedia corpus.


Wikipedia is one of the topmost visited websites in the English-speaking world and serves as a first go-to when looking for a quick overview on almost any topic. It’s surprisingly accurate and reliable given its anyone-can-edit ideology and site structure (roughly as reliable as the Encyclopedia Britannica according to a 2005 Nature study). Bots and users alike peruse the site for vandalism and other problematic edits, marking certain sentences, paragraphs, and entire articles as having a problem from a specific set of pre-defined tags. Fortunately for us, POV, or point of view (implying non-neutral point of view) is one such tag. There are several thousand articles in the English language Wikipedia with current POV tags, which provided a substantial enough body of data with which to train a bias detector. We downloaded the entire current-version (as of May 2017) English Wikipedia corpus in XML and stored it in the cloud using Amazon’s EC2 and S3 services. After separating the massive (~64GB) textual dataset into separate pages by the embedded XML page tags, we were able to filter the pages down to the usable articles (removing redirect pages, talk and other user pages, and simple list pages). The POV tags within each page’s text follow a replicable format (“{{POV … }}”), allowing for a regex-like filtering of the total articles down to just the articles with POV issues, and furthermore, articles for which the entire body of text is tagged as having POV issues (as opposed to just a paragraph or single line) also show a replicable special POV tag located in the header of the text, making filtering out articles with only very minor POV issues simple.


We then sampled from the rest of the corpus (presumably without POV issues) to generate training and test data sets with roughly equal proportions of non-neutral and relatively neutral articles. The XML markdown-style documents were then parsed with a set of regular expressions to generate a simple bag of words for each article (thereby removing all tags, including those that mark the article as having POV issues). We tokenized the bags of words and removed stop-words, followed by generating TF-IDF values for each unique term in each document (Term Frequency Inverse Document Frequency is a canonical way to calibrate simple bag of words term counts into relative term importances by adjusting by the rarity of each term across the entire corpus).


We trained many kinds of models to analyze which may be best at detecting bias in Wikipedia articles (and, by extension, in other sources of textual data as well). Among the best performers were logistic regression, random forest, and XGBoost (see metrics below).

Wikipedia model comparison.png


Additionally, we found that our models were fairly confident in their predictions, giving further credence to the idea that something as complicated as bias can be well handled by NLP and machine learning methods (x-axis is predicted probability, and y-axis is the number of such observations):

Wikipedia validation set predicted probabilities by model.png


We further investigated correlations in misclassification between these models to see if an ensembled approach would significantly improve our results. However, high correlations between our models limited the extent to which we could improve our overall misclassification rate through ensembling. As you can see below, any majority-voting metric would have misclassified at nearly the same rate.

Wikipedia misclassification.png