Links to code on Github: code transcriber and resulting Python code
Executive Summary:
I wrote code that transcribes my (native) R lemmatizer into Python for wider use. As the lemmatizer performs the same few basic operations many times, I was able to teach the transcriber how to transcribe each type of statement into idiomatic (Pythonic) Python, making it usable, readable, and efficient in both languages. In this way I’ve halved my need to update and debug code, saving myself much time and many headaches.
Motivations:
After developing my lemmatizer in R, I wanted to use it in conjunction with other NLP packages in Python. Keeping in mind that I’d often make small changes in my lemmatizer code in R, I wanted some way to easily and safely both generate and maintain the Python cousin of the same set of code. Working on code in multiple languages is a part of many team projects, and I wanted to get into the habit of never duplicating myself, as writing the same programing idea in two languages from scratch is a waste of resources.
The idea is simple to conceive but (as I found) somewhat difficult to thoroughly execute, especially defensively to unforeseen changes. Some language aspects (R’s use of brackets and default two-space indentation vs Python’s use of colons and four-space indentation) are relatively straight-forward to fix with a few well-tuned regular expressions (regex). Other aspects are much more difficult to properly handle: among these are Python’s basis on lists vs R’s basis on vectors, handling missingness across two languages that differ widely in how missingness is encoded, translating carefully-tuned string operations into Pythonic Python (both for the sake of readability and performance), and transcribing statements written over more than one line. I’ll address these in order below.
Python’s lists allow a kind of freedom that R’s vectors don’t have, such as handling multiple data types and also allowing nesting (similar to R’s lists in these respects). The better analogue to R’s vectors in Python is numpy’s arrays, and I naturally made extensive use of these for efficient vectorized calculations on input words. However, for convenience to the user (including myself) I forced myself to accept the most common structure as input, which for Python is still the list, handled with one simple customized list comprehension ensuring type safety and flatness. R makes extensive use of its combine c() method for creating vectors, which has the unique property of ensuring flatness. I had to use a few other customized bits of code to ensure flatness after some of the vector combinations I had written in R.
Numpy strings do not natively support missingness. Wherein my R code made use of boolean missingness to denote values neither true nor false (e.g., some words like sheep are neither plural nor singular), I had to work around this by simply considering things singular if they do not need to be made singular for the lemmatizer to work. It would be possible to keep a separate boolean numpy array for missingness, but that seemed excessive when a simple solution such as the above worked fine (after all, the complicated solution would not be very Pythonic).
Translating idiomatic R (such as that with extensive use of infix operators) into idiomatic Python (list comprehensions, tuples, and such) required a more fine-tuned approach, and such transcriptions needed to be baked into the transcriber due to their extensive use. I translated most of R’s paste() statements into list comprehensions with simple string concatenation. I found that while Python strings conveniently support using a tuple as the argument in the endswith() method (to check if the string ends with multiple endings), numpy.chararray only supports single strings (and numpy.ndarray does not support the endswith() method at all), also necessitating the creation of more list comprehensions. I had to take many things into account when moving arguments around (e.g. from either infix operator position in R to the object-like position in Python), including parentheses, operators and their order of operations, and non-coding character strings (either in literal, quoted strings or comments).
Additionally, I taught the transcriber how to evaluate arguments on separate lines to handle my longer, complicated vectorized logical computations. Adding the trailing backspace for Python was easy; identifying whole argument chunks and moving them around appropriately was not.
After running the code transcriber, the output Python code only requires a handful of manual changes—which can be copy-pasted identically each time as they are the most fundamental and language-dependent processes and least likely to change—to function identically to its R counterpart.
Future developments could further reduce the need for the remaining few manual changes, though some are likely infeasible (or prohibitively burdensome) to handle in Python (e.g. non-standard evaluation, which is very idiomatic and unique to R). The pluralization functions are readily transcribable as well once I remove their dependency on non-standard evaluation, which is a convenient but not essential aspect of their function in R. I’d also like to further improve the readability of the code in ways that don’t significantly impact performance, and plan to explore subclassing numpy.chararray to clear up some boilerplate verbiage among the methods (especially when it comes to testing element-wise if a string matches any of a few targets).