Category Archives: algorithms

Re-learning English: Pluralization and Lemmatization

Links to code on Github: the singularizer and pluralizer, and the lemmatizer


Executive Summary:

I created a word lemmatizer in R and Python that automatically “digests” words down to their core form for the purpose of dimensionality reduction for NLP (Natural Language Processing) tasks. The lemmatizer works by recognizing formulaic endings to English words, yet handle enough special cases that they outperform stemmers such as Porter and Snowball in accuracy. Because this lemmatizer relies on structured word endings, it is less dependent on external dictionaries than most lemmatizers such as WordNet, and as such, it performs nearly equally as well on real or invented words that do not appear in dictionaries (unlike WordNet).



My motivations starting this project were simple: dig my hands deeper into NLP (Natural Language Processing) and develop useful code for myself where other code wouldn’t (or wouldn’t quickly) work. What started as a quick exploratory fix on my Wikipedia Spark project then turned into a substantial sub-project that carefully handles English words; I will focus on the final product here.


There are two main camps for dimensionality reduction in the NLP word: stemming and lemmatization. Stemmers (usually) simply chop off endings in an attempt to eliminate relatively meaningless differences in phonology (e.g. does the analysis really depend on whether he walks or I walk? The key idea is walk in both cases). While useful for their speed and simplicity, they suffer from an overly simplistic approach to manipulating words that results in many disconnected similar words (e.g., multiply and multiplication stem to similar but distinct terms, a type II error of sorts) as well as many spuriously connected words (e.g. universal, university, and the made-up word universion all stem to univers, a type I error of sorts). Lemmatizers, on the other hand, seek to capture the “lemma” or core/root of a word by looking through relationships in dictionaries. With lemmatizers, phonologically difficult relationships like that between better and good are accurately captured, so they’re very useful when accuracy is needed (especially for reduction of type I errors/overstemming). However, with a lemmatizer, one often has to put in extra under-the-hood work to link together related words of different parts of speech, first detecting the part of speech (with error) and then traversing the web of derivationally related forms (and synsets and pertainyms, etc.) until one decides on a “root” form. Furthermore, words not captured by the dictionary can’t be handled properly, which is a more common problem than one would expect. For instance, what’s the relationship between carmaker and carmaking? WordNet, one of the most powerful and popular lemmatizers (used by the popular NLTK package), doesn’t know, as carmaking isn’t defined. Undefined words can be accurately processed by stemmers (the carmaking example is, for example) but pose a real problem for many lemmatizers. Clearly there is a middle ground between the overly general world of stemming and overly specific world of lemmatizing, so I set out to create a program that would find it.


English as a language is characterized by its decentralization and archaic spelling as well as its repeated (though occasionally inconsistent) use of patterns to generate inflectional forms. This fact requires a mix of dynamic/rule-based approaches (to generalize the inflectional system) as well as static/case-based approaches (to successfully handle legacy spellings and manage the confusing mixture of Latin, Greek, French, and Anglo-Saxon morphology that characterize English words).


So, I created the “word digester” to combine the best aspects of stemmers (simplicity/ease of use; usefulness on words left out of the dictionary) with the best aspects of lemmatizers (comprehensiveness; handling of special cases; readability of outputs). It will take any English-like word (newly invented or well-established) and digest it into its core form. It handles enough special cases to be useful on some of the more peculiar English word forms, yet it uses enough general rules to be useful without relying on a dictionary. It’s not prohibitively slow either (digesting a 20k-word dictionary in roughly 1-3 seconds in both R and Python). Admittedly, as a recent personal project of definite complexity, there are still plenty of gaps in its coverage of special cases, which can be fixed in time. Below I’ll highlight a few examples that highlight the capabilities of the word digester and also discuss the singularizer and pluralizer functions I created as a companion project to the digester.


The (real or invented) words multiple, multiples, multiply, multiplied, multiplies, multiplier, multipliers, multiplying, multiplication, multiplications, multiplicative, multiplicatives, multiplicativelymultiplicativitymultiplicativeness, multiplicationment, multiplicativement, antimultiplicative, unmultiplicative, nonmultiplicative, dismultiplicative, multiplicate, multiplicating, and potentially more all reduce to multiple.


It handles some of the most relevant Latinate or Greek plurals as well: symposia and symposiums both reduce to symposium, and lemmata and lemmas both reduce to lemma, for instance, and the pluralize function allows selective creation of either plural form in the reverse direction. In addition, hypotheses reduces to hypothesis, matrices to matrix, series to series, spectra to spectrum, phenomena to phenomenon, alumni to alumnus, and so on.


Anglo-Saxon-style forms are also covered: berries reduces to berry, cookies to cookie, leaves to leaf, children to child, oxen to ox, elder to old, (wo)men to (wo)man, people to person, brethren to brother, and so on.


Many irregular changes are covered: pronunciation reduces to pronounce, exemplify to example, redemption to redeem, tabular to table, operable (and the made-up but more regular-looking operatable) to operate, despicable to despise, grieve to grief, conception to conceive, flammable to flame, statuesque to statue, wholly to whole, shrubbery to shrub, and so on.


Sensible words outside the dictionary are also properly handled: both buttery as well as butterific reduce to butter, and planning as well as the E.U.-favored planification both reduce to plan. Feel free to make up your own words and try them; currently only some rules are sufficiently generalized, so while antidisestablishmentarianismesquely reduces to establish, I can’t promise that every made-up word will be processed correctly.


I’ve verified, semi-manually, the digester’s results on a few corpora: a 1k-word corpus, a 20k-word corpus, and a 370k-word corpus. No errors were made with the 1k corpus; nearly all of the 1200 or so purported errors (defined here as digested word results that do not also appear in the same source corpus) made on the 20k corpus are either—from most common to least common—errors on proper nouns (expressly not intended to be handled by the function as other methods exist for that purpose), real words that do not appear in that corpus (not an error), or genuine errors that I haven’t bothered to fix yet (estimated on the order of 10 or so). This implies a true error rate of around 1% or less on the 20k-word corpus. I have skimmed through the purported errors on the 370k-word corpus (~47k of them), and have similarly found that most of the purported errors either are on proper nouns, are not real errors at all, or are on exceptionally rare, esoteric, often medicinal or taxonomical words (as these make up most of the 370k-word corpus to begin with as a consequence of the Zipf distribution of word frequency).


How does it work? Take a peek at the code on Github to see for yourself—though it is long, it is well-commented and fairly self-explanatory. The basic process involves handling each of the many recognized suffixes in the order in which they almost universally appear in words. Special cases are handled before the best-guess set of general rules handle the suffixes. Masks are used to hide words that have suffix-like endings that should not be removed. One of the most difficult aspects of making the digester as accurate as possible was designing an effective regex test to determine whether an ultimate “e” should remain attached to a word after suffix removal. The suffixes are handled in groups, in which only relevant subsets of words (those that end with something that looks like the suffix in question) are handled for efficiency. Writing some aspects of this code in C would certainly greatly speed up computation, but there isn’t any need for that on most scales as is, as I’ve designed the code such that it does not scale directly with the size or length of the input, but rather with the cardinality of the input (i.e. it handles each unique word only once).


The order I’ve used for processing is:

  • singularize plural forms
  • remove contractions
  • fix irregular common words (all as special cases)
  • remove certain prefixes (with a mask to prevent similar non-prefix forms from being removed)
  • remove adverbial ly suffix
  • remove noun-form ness and ity suffixes
  • remove adjectival esque and ish suffixes
  • remove adjectival able and ible suffixes
  • remove noun-form hood suffix
  • edit miscellaneous Saxon- and French-style suffixes
  • remove istic, ist, and ism suffixes
  • remove adjectival al, ial, ical, ual, and inal suffixes
  • remove adjectival ian suffix
  • remove adjectival ary suffix
  • remove noun-form ment suffix
  • remove adjectival ic suffix
  • remove adjectival ous suffix
  • remove adjectival ful and less suffixes
  • remove adjectival ar and iar suffixes
  • edit noun-form ance and ence suffixes
  • remove adjectival ant and ent suffixes
  • remove adjectival ive suffix
  • remove noun-form age suffix
  • remove noun-form tion and ion suffixes
  • remove noun-form ery suffix
  • remove adjectival y suffix
  • remove noun-form itude suffix
  • edit noun-form ysis suffix
  • remove comparative and doer-form er suffix
  • remove superlative est suffix
  • remove verb-form ed suffix
  • remove verb-form ing suffix
  • remove verb-form ify suffix
  • remove verb-form ize suffix
  • remove verb-form ate suffix
  • fix miscellaneous verb forms to similar noun forms (e.g, multiply to multiple)


Re-Learning English multiple.png

What’s important to note is that each step along the way, the suffixes are removed such that a valid word is produced. In this way, the transformation pipeline functions such that it “collects” different derivationally related forms on their way toward the base form, analogously to how a river’s tributary system works. As such, any failure of the system to properly join one subset of derivationally related forms into further pathways toward the base form will leave that subset as a single valid (albeit not completely basal) word instead of creating something new and unrecognizable. Additionally, this design makes debugging and adding extra special cases easy, as each special case only need be handled once in order for all its derivationally related forms to also be successfully processed. Similarly, it’s straight-forward to force the system to not join together some branches if desired––the change need only be made in one place.



I initially coded up these functions in R, though wanting to expose it to wider use, especially within the NLP universe (which is very Python-centric), I transcribed the code into Python with the help of an R-to-Python transcription program I wrote myself. To learn more about that process, see this post.

An Basic Approach to Reinforcement Learning

Executive Summary:

Despite my lack of expertise with the topic, I used this approach to win 3rd place at a Trivago hackathon in NYC on November 11, 2017. The task was to improve Trivago in some way (an incredibly open objective), and most teams, including the one I pitched at led, focused on improving Trivago’s web portal and search listings to help grow Trivago’s market share. My idea was roughly to apply a basic reinforcement learning approach to change the static (non-learning) search result ranking algorithm into one that continuously improves its market segmentation based on user interaction with the site. My proposed algorithm ranks search results according to what has worked the best with demographically similar visitors in the past, along with enough randomness to continue learning (i.e. not get stuck in a local minimum). The Trivago judges mentioned that the general idea is one they’re already developing and that they liked my specific approach, which was a bit different than what they had considered thus far.



It is sensible to assume that everyone prefers better-rated hotels at better prices, but I’d expect some customer segments to be more price-sensitive than others. (Some people will always book the cheapest option; some will always book their favorite regardless of price; and some will carefully deliberate.) I think it’s also sensible to assume that this behavior will, to some extent, track with demographic information such as income. And further, since IP is a proxy for location, and location is a proxy for demographic information like income, Trivago and other similar companies should be able to use such information when ranking listings for users.


I created a multi-dimensional space (numpy’s ndarray being the natural choice) that keeps track of how price-sensitive and discount-sensitive demographic segments are: one dimension for income segments (log transformed), one dimension for age segments, and one dimension as a proof-of-concept placeholder for other segments like gender. A final dimension just divides the space up into price-sensitivity information and discount-sensitivity information, though I conceptualize it as a three- (or n-) dimensional space pointing to a two- (or m-) element vector. Such a system does not rely on particular values of m and n, which can be chosen based on what provides the most useful information (n being the number of demographic information types that are useful, and m being the number of sensitivity-to-hotel-aspect features that are useful). I’ll call this the “demographic preference space”.


I initialize a few parameters, such as the learning rate, the level of randomness, and a set of effect-size parameters that influence how much different types of actions (from greatest to least: creating an account, purchasing through express checkout, clicking on a hotel, or leaving without purchasing anything) affect the demographic preference space. While making a hotel purchase does directly indicate a user liking a particular hotel, there’s unfortunately no direct measure of a user’s dislike of a particular hotel, so I set the effect-size for leaving without purchasing anything as negative to “blame” all the hotels presented to the user a little bit for not being what the user wanted.


Each interaction changes the demographic preference space based on the learning rate and the effect size. When deciding which hotel listings to present to a new user, each hotel that corresponds to their searched terms gets scored based on how that hotel scores along each hotel aspect (price and discount) compared with the assumed preferences of the user based on the demographic preference space. In more detail, a local Gaussian average of price-sensitivity and discount-sensitivity are taken in the demographic preference space, centered at the user’s own demographic position. These sensitivities are passed through a hyperbolic tangent (tanh) function to provide reasonable bounds on their amplitudes. Each sensitivity (for instance, price-sensitivity) is multiplied against its relevant score from the hotel (for instance, price-score), and each of these terms is summed for a total score for the hotel for the user. As I also scaled hotel aspect scores (price scores and discount scores) from +1 to -1, low discount-sensitivity (-1) against low discount (-1) registers as a match, just as high discount-sensitivity (1) against high discount (1). A (relatively small, adjustable) random number is added to the score for each hotel to prevent the algorithm from getting stuck in a local minimum, and the hotels are sorted by score and displayed to the user accordingly.


The main proof of concept was showing that despite initial rankings being random, wealthier and less wealthy phantom users applying their preferences (toward nicer hotels and toward cheaper hotels, respectively) quickly (within a few hundred interactions) resulted in wealthier users being shown accordingly nice hotels and less wealthy users being shown accordingly cheaper hotels.


One major challenge of this competition was the lack of any user or hotel data, which I had to write functions to generate for this proof of concept. That, combined with the competition’s duration of only one day, prevented any more sophisticated approaches, though I had thought of a few. The algorithm as I submitted it does not make any outright assumptions about hotel preferability, instead just following what the users pick. Perhaps a hybrid approach could improve results. There were also plenty of other hotel aspects, most notably rating (and users’ sensitivities to these hotel aspects, e.g. rating-sensitivity), that could have easily been included in the demographic preference space. And despite using some nuts and bolts of neural networks (such as the idea of a learning rate, vaguely convolution-like processing of the demographic preference space, incrementation of weights based on fit/error assessment, and use of functions like tanh to prevent “blowing-up activation”), there was no neural network component to my solution despite being well-suited for such an approach. Had we more time, I would have pursued such low-hanging fruit as next steps.

A Brief Explanation of My Master’s Thesis

While impossible to cover the full breadth of the 204-page behemoth itself, for those curious I’ve included a computer science-only highlight summary of my Master’s thesis. I’m shying away from most bioscience terminology and avoiding most of the biological background, as the goal of this article is to overview my project for those outside the field.



Photosynthesis is incredibly important for life on Earth as we know it, as it creates and sustains the oxygen in the air that we breathe. Much of that oxygen is produced not only by plants (though they do play a large role) but by cyanobacteria (photosynthetic bacteria that also formed the foundation for how photosynthesis works in plants and algae) covering the world’s oceans. There are many basic biological questions about these essential creatures that we don’t understand. One involves how these life-giving membranes divide.



Cellular division is fairly well understood, but that’s only part of the puzzle of how such cells divide. If you imagine dividing a lasagna in two for two people to eat, you simply can’t separate the two halves without separating each layer from top to bottom; you have to make a full cross-sectional cut to separate the two pieces. Carrying this analogy forward, cellular division (division of the cellular membranes and the cells themselves) only handles how the top-most and bottom-most layers of the lasagna get cut—the inner layers (here, the thylakoid membranes) must also be split through some unknown mechanism.



As a one-sentence summary of the thesis, I propose a short but comprehensive set of possibilities for how this (thylakoid membrane division) might occur, observe the process carefully in two divergent species of cyanobacteria using light microscopy, implement a new image segmentation technique tailored for this setup, and find that the two species appear to use different processes for dividing their thylakoid membranes, one orderly and one disorderly.


These findings suggest different approaches for discovering the mechanistic details of how this division occurs, and I lay the groundwork and a roadmap for future work in this area, including specifying possible genetic targets (though I’ll skip that here).



Master's Thesis Wikipedia Synechocystis.png

General (simplified) thylakoid membrane morphology, using genus Synechocystis as example. Source: Wikimedia Commons.

Master's Thesis Models.png

Division models proposed in my thesis (the thylakoid membranes are shown in green with red edges). On the left we have division triggered specifically in advance of the dividing cell membranes; in the middle we have sufficiently frequent division such that a triggering mechanism is not needed; on the right we have no division mechanism short of the thylakoids being forcefully “cut” by the dividing cell membranes.



Burden Placed on Image Analysis:

The relevant details here are those relating to the segmentation algorithm I developed and implemented. A few background observations are also important to note. First, these bacteria are small and grow in clusters with very little distance to separate each individual cell. The combination of these two factors in particular makes proper segmentation of each cell difficult. Edges, that when observed at a large scale are sharp, are instead necessarily blurry due to the diffraction limit of light (the cells are not much larger than the waves we’re using to image them, so resolution is inherently limited by the physics of light). The high magnification required to visualize the cells (and their internal membranes) also means that we won’t be able to afford a high density of photons per pixel in the image, which means that the images will be inherently noisy (grainy) as well. Attempts to mitigate either of these problems (the blur and the noise) by choosing a smaller wavelength or increasing the intensity of light will poison the cyanobacteria with light outside their acceptable range of energies (the photosystems are tuned to specific frequencies within the orange-red part of the visual spectrum, and they can only sustain so much irradiative intensity before being damaged and destroyed). Longer exposures miss out on potentially important second-by-second information about the division process (yes, I did find significant movements that occurred faster than the framerate could capture) and are also not ideal. As such, we unfortunately have to manage these problems in silico (computationally) instead of in vivo (physically). Offloading these physical problems onto the computational side of the project places a high burden on the image analysis system. Typical tools used to segment images of cells (typically much larger, eukaryotic cells) performed poorly, often segmenting clusters of cells and/or mere portions of cells.


Imaging Solutions:

I implemented a dual-channel approach in which two related but independent measures of cell shape were taken: one using phase contrast and one using the inherent fluorescence of the thylakoid membranes distributed throughout the cells. Phase contrast microscopy uses differences in refractive index (specifically the retardation of light waves passing through media of higher refractive index) and the ability of light to self-interfere to bring attention to the borders between objects of different refractive index (in a physical as opposed to non-computational way). It is the standard approach for cell segmentation, but as indicated earlier, due to the difficulties of this particular imaging setup, it performed poorly on its own. Consecutive phase contrast images were also too correlated and mistakes made on one by the general segmentation software were often repeated on the other. Cyanobacteria, due specifically to their light-feeding properties, are also fluorescent under certain wavelengths of light. I imaged each field alternatingly in quick succession using phase contrast and epifluorescence microscopy (epifluorescence specifically measures light emitted by a fluorescent object, as opposed to possible reflections or diffractions around it). These two orthogonal (in the informational though not geometrical sense) measures of cell shape provided much more information for accurately segmenting the cells, so I went about implementing my own image segmentation algorithm capable of simultaneously using information from both channels.


The benefits of using these two imaging techniques go deeper than just providing alternate estimates of the cells’ shapes. Each has its own associated gradient, and these gradients properly anti-align at the cell boundary (thylakoid fluorescence is brightest inside the cell, where the thylakoids actually are, and phase contrast is darkest inside the cell, due to the cell contents’ slowing of light as it passes through leading to interference). The information gathered from each source therefore had the greatest detail (toward the right side of the histogram in photographer’s terms) where the other had the least detail (toward the left side of the histogram); the sources were exceptionally complementary.


Algorithmic Approach:

The use of these complementary imaging methods made it feasible to segment cells, but what about the greater task of segmenting some ten thousand frames of a hundred or more cells each within a reasonable time frame? One typical approach is to use constricting polygons around each cell center with an “energy” function that penalizes large distance between the vertices of the polygon and further penalizes sharp angles between polygon edges in a manner tunable with hyperparameters. Such methods calculate the total energy around the perimeter of the cell many times as a local energy minimum is approached. Such an approach, aside from being computationally expensive when considering a million cell-frames, is weak both to local minima in noisy images and to thin cells pressed side-by-side in clusters, both of which are characteristics of the imaging context we’re considering here. I knew a more appropriate method would be required, and set out to design one from scratch starting from first principles and immediately considering this unique imaging context.


The longer-term goals of the imaging project were to carefully and accurately analyze aspects of cell shape and internal thylakoid membrane shape throughout the division cycle, which would require exceptionally accurate estimations of cell size and (angular) orientation for spatial alignments and averaging across multiple cells (for seeing through the fog of noise in particular). I therefore decided to highly prioritize accuracy in segmented cell shape at the cost of “missing” some cell-frames (i.e. biasing towards type II-like errors—dropping a correct cell outline—for the sake of minimizing type I-like errors—wrongly accepting an incorrect outline), which could be compensated for by just averaging over even more data. I’d need to develop a fast algorithm, preferably analyzing each cell-frame’s cell outline just once, in order to capture enough data for the high degree of averaging I’d be doing.


The Algorithm:

I decided on a snake-like algorithm that would start from the most recognizably cell-edge-like pixel around a cell center and progress one circumference along the cell edge pixel-by-pixel. The direction of travel by the edge finder was guided orthogonally by both the thylakoid epifluorescence and phase contrast image gradients (following directions in which the direction’s cross product with each gradient had high magnitude and opposite sign) and also by the previous directions traveled (allowing movements in directions for which the dot product of current direction and average previous direction are non-negative). In this manner, the algorithm would analyze only a local window of pixels around each cell’s edge only once, and it would terminate once it formed a closed loop of reasonable perimeter and eccentricity with centroid sufficiently similar to the last capture (as quality control to maintain accurate cell outlines). Each overall frame was first thresholded (with binary openings to reduce noise) in each image channel to best find cell centers and define reasonable realms in which the snakes could roam (for instance, not allowing them to stray off too far from the cell clusters). I tuned hyperparameters for each species of cyanobacteria (one rod-shaped, one approximately spherical) to best apply strict quality control to acceptable cell outlines that were being tracked and analyzed.


Each cell was assigned a uniquely identifying number and tracked over its lifetime from first detection to division into two separately segmentable cells. The cells were physically held in place by their being sandwiched between the imaging glass (slide) and a nutrient-containing agarose hydrogel to prevent drift, and the algorithm would remember each cells’ centroid for continued tracking and update said centroid with subsequent frame in case of drift (which did occur—in sudden, earthquake-like fashion, though maximal drift only reached 1-2 pixels per frame—as the agarose hydrogel slowly dried over the 24+ hours of imaging).


Data (such as calculated centroid, angular orientation, eccentricity, length, width, etc.) on each cell in each frame was saved in a large array, and a local snapshot of the cell centered at the cell’s centroid (in both phase contrast and epifluorescence) was saved to a directory particular to each cell. After the full segmentation run was completed, each cell’s course of division could be viewed in video format by stitching together that cell’s snapshot (in either phase contrast or epifluorescence) in each frame, along with an optional temporal Gaussian blur (applied over a few consecutive frames with a stride of 1) for image clarity. Furthermore, a composite video was created for each image type (phase contrast or epifluorescence) and species showing an average time course of division for that species by spatially rotating each cell’s images by their orientation at that time and temporally aligning each cell’s snapshots in reference to the frame of its completion of division.


A Few Example Images:

This slideshow requires JavaScript.



Aside from the aforementioned finding that one species’ thylakoid division process was orderly (appearing as a very clear, non-dynamic structure that would split along the division plane partway through cell division) and the other was disorderly (appearing as a highly dynamic set of constantly forming and breaking thylakoid connections between roaming thylakoid regions that seemed to make way for the dividing cell membranes on their own just through constantly rearranging), a few other results from the field were recapitulated. Cell division was found to be loosely synchronized in one species as previously reported and completely asynchronous in the other (also as previously reported).


Example of High-Throughput Analysis:

Master's Thesis Synechococcus data.png

Due to the significant amount of noise, analysis was only possible using high-throughput and highly averaged approaches (necessitating the image segmentation algorithm). Over one hundred distinct cells’ data (colored traces in the background) were averaged over their lifetimes (black trace in the foreground) to generate insights. Here we see potential evidence of thylakoid membrane division in advance of the completion of cell division in Synechococcus, supporting Model 1 for this species.



Overall, this process gave me a solid experience-based foundation in methods of image segmentation and image analysis generally, which I carry with me in my current work as a data scientist. According to a recent survey-based report by Kaggle, image data is the third-most common data type used in the field of data science, after relational data and textual data, and comprises an estimated one sixth of data science work (including video data). Based on personal experience, it also seems to be relatively ignored among data scientists in training, who generally stick to the less esoteric realm of relational and textual data. It’s given me a greater perspective on the overall process of finding kernels of truth within seas of information as well as a few translatable skills that I can cross-apply to data analysis tasks in general.

All About the Digits: Approaches to the Canonical MNIST Dataset

Links to code on Github: and


Executive Summary:

I’ve created a set of functions that can pre-identify some features in the MNIST data set that one would normally imagine a first or second hidden layer handling. Mainly, it can identify (with some limited success) distinct straight line segments. Further steps will include feeding this information into a neural network to assess whether it improves classification accuracy or not, and improving the feature detection as well.



Neural networks, especially those of the convolutional or capsule variety, have an astounding ability to recognize and process images. However, they come with one major drawback: the reliance on an enormous volume of data for training. The amount of data and training they require is often a barrier to developing a competent network at a sufficiently difficult task (in terms of access to enough data or access to enough computing power). This need is beyond that of other machine learning approaches, and well beyond what seems necessary for learning in their distant biological cousins. For instance, I let you study one example of one Chinese character closely for some time (a few minutes should suffice), even with no prior knowledge of Chinese, you would be able to discern it against other Chinese characters. Furthermore, you would nearly as easily recognize it in scale- and rotationally variant forms. As deeper neural networks (which have traditionally been the direction for improving computer vision performance) explore increasingly astronomic possibility spaces (in which it’s hard to imagine converging on a workable solution), I can’t help but wonder if there are any sensible hard-coded features that are worth adding into neural networks to allow shallower neural networks (and accordingly less data and less processing) to achieve similar performance, as if by acting as a replacement solution for one or more lower layers. Such an approach would also help to make the resulting neural network systems less of a black box. Below, as an ongoing project, I’ve catalogued some explorations into possible features that may aid in classification of the MNIST handwritten digits, with the purpose of further improving my understanding of image techniques and as a means to explore neural networks.


The feature detection works by calculating basic image gradients at each pixel, arbitrarily assigning a direction orthogonal to the gradient at each pixel (always a ninety-degree rotation in the same direction), clustering the pixels into discrete groups based on pixel alignment with their neighbors, and systematically flipping the polarity of their directions until optimally “laminar” flow is achieved throughout the whole system (until each pixel group is most similarly aligned with its neighbor groups). These groups, when further clustered by similarity in alignment, outline distinct line segments.


Original image:

MNIST 4-1 raw.png

Angle based on gradient direction:

MNIST 4-1 gradient.png

Clustered into small groups:

MNIST 4-1 semi-joined.png

Joined into segments:

MNIST 4-1 joined.png


Below are two examples for each digit. Results are unfortunately still inconsistent for curvy digits (0, 2, 3, 5, 6, 8, 9), though fairly consistent for straighter-line digits (1, 4, and 7).