Author Archives: searchingforbole

About searchingforbole

Exploring Data Science. Data Science bootcamp graduate. Most recently General Manager and Program Director for Sunrise International Education. M.S. in Biochemistry, Stanford, Dec 2013. B.S. in Biochemistry, Columbia, 2011.

Predicting House Prices During a Declining Economy: A First Look into Kaggle Competitions

Executive summary:

I used supervised and unsupervised machine learning algorithms—primarily Multiple Linear Regression, Principle Component Analysis, and Clustering—to accurately predict prices for Sberbank’s Russian Housing Market Kaggle competition. I developed these models using a data pipeline that cleaned the data based on my research findings, tidied the data into Third Normal Form, transformed features to appropriately fit the models used, engineered new features where appropriate, imputed missing data using K-Nearest Neighbors. I then used the Bayes Information Criterion and residual plots to identify important and sensible underlying factors that affect housing prices, and I created a predictive model with validation.



My first glance into the world of Kaggle competitions was an interesting one: international sanctions, a collapsing oil economy, a nascent coffee culture, and tax fraud all contributed significantly to a proper understanding of Sberbank’s Moscow housing market dataset. As a business-facing problem, successful analysis of such a dataset must include two main components: insights brought forth by interpretation of the data, and accurate predictions brought forth by the best model. Below I present how to go about this tall task and a review of the major factors that impact the nominal price of housing in Moscow.


The objective of this competition was to accurately predict prices of housing units in Moscow for Sberbank given the data it provided on Kaggle. This included a set of macroeconomic data from the years 2010-2016 (overlapping Russia’s conflict in Crimea and its international response of increasing sanctions, along with the collapse of the Russian oil economy that followed), and a set of housing unit data from the same period, with prices for the period from 2010 to April 2015, and an unpriced test set from April 2015 onward used for model scoring and ranking.


Aside from typical issues with missingness and inaccuracy that one expects in any real-world dataset, first attempts at modeling the data performed unsatisfactorily due to an insidious issue with the quality of the data: a predominance of uniformly cheaply-priced units in the far left tail of the price histogram. See below:

This slideshow requires JavaScript.


Such effects were further compounded by the fact that the Support Vector Machine I constructed failed to classify which units might end up selling at such a “subsidized” price and which would sell at prices within the typical distribution for Moscow houses. This vexing class of housing units ended up having a much simpler explanation after I briefly looked into Russian capital gains law: tax fraud. It’s apparently common practice to report significantly lower house prices for the purposes of property tax evasion, so I assigned all suspicious prices (the glut of prices clustered at or just below the RUB 1 million and 2 million property tax cut-offs) to missing and imputed instead, which greatly improved the accuracy of the model. Sometimes the answer to a data conundrum comes from outside the data.


Additional preliminary looks into the data revealed that Sberbank would strongly benefit from a data engineering team. The dataset Sberbank provided was primarily composed of highly redundant features slapped together into an inconsistent and amorphous blob that violated basic Tidy Data principles in multiple ways. In light of this, I developed a data cleaning and tidying pipeline that was key in my team’s success in the competition. Here are some of the ways I confronted these issues:


I set out to build an interpretable multiple linear regression model with the goal of providing useful insights into the Moscow housing market (as opposed to using a more powerful black-box model). I constructed this model using features engineered in three ways: native features transformed to avoid violating requirements for use in a linear regression model (e.g. linearity, homoscedasticity, and a normal-like distribution), composite features generated to avoid issues associated with multilinear regressions (i.e. PCA to resolve collinearity), and novel features engineered to better relate a feature’s effect on price (e.g. thresholding and further transformations).


Prices were distributed nearly log-normally (a Box-Cox transformation showed best-fit lambda close to 0), so I log-transformed price figures. Other features (such as apartment size) showed much-closer-to linear fit upon log transformation as well (along with much-closer-to normally distributed errors), so for best incorporation into the multilinear regression, I log-transformed those features as well. These log-log relationships also displayed much lower heteroskedasticity compared to those of the untransformed features, further necessitating the transformation. Other features, particularly temporal economic figures, required separate modeling, as they were duplicated with differing frequencies (e.g. Sberbank copy-pasted weekly-measured figures for each other weekday and copy-pasted monthly-measured features for the rest of the month so that one independent measurement masqueraded as multiple separate observations). See an example of the heteroskedasticity below:

This slideshow requires JavaScript.


Matrix correlation plots (below) revealed that the dataset consisted of two sets of mostly highly correlated data. The first set contained many of the most explanatory features, so I selected the most useful of these for use in the model and left the rest out. I reduced dimensionality in the second set through PCA and found a handful of useful features (principle components) that I also added to the model. After investigating the significant principal components (left of the “elbow” in a scree plot) for interpretability in addition to significance, I also included the top 10 PCs from the set of distance features and the top 4 PCs from the set of coffee-related and object count features.

Kaggle correlations.png


I explored reducing the complexity of the raion feature using agglomerative clustering, though the lackluster results of such explorations (in addition to the failure of raion characteristics to model raion residuals of the best model missing the raion feature against the true values) further strengthened my sense that the coefficients of the raion categoricals are more a measure of neighborhood popularity (je ne sais quoi) than anything else. It is reasonable to suppose that factors outside (and unmeasured by) the dataset would also be affecting prices; demographic and cultural information for each raion was limited, and such effects would effectively be captured by a catch-all feature like the raion categorical itself.

So what factors do Muscovites react strongly to when pricing a housing unit, and in what ways?


Muscovites like:

  • larger units (by far the biggest contributor to price)
  • desirable neighborhoods (the second-largest contributor)
  • units in better condition
  • living in taller buildings
  • living on higher floors within those buildings
  • expensive coffee in the center city (hipsterism?)
  • proportionally larger kitchens
  • living closer to parks
  • living within walkable distance of a metro station
  • living near big shopping areas


Muscovites don’t like:

  • living too far away from the city center (another major contributor)
  • living right by highways
  • living right by railroads
  • living right by power transmission lines
  • living right by oil refineries
  • panel or breezeblock construction materials
  • old buildings
  • buildings with contemporary-style architecture, regardless of age


Additionally, they’re willing to pay more for ownership-style apartments than investment-style apartments (or house-buyers may be less savvy than real estate investors).



While Kaggle-style competitions tend to reward black box models, kernel-copying, and hyperparameter-hacking through repeated submissions (submitting models fudged by different amounts until the score happens to improve as a kind of over-fitting), I took it as a way to learn how to better perform regular data science, using only the kinds of models and techniques that I could justify to a supervisor looking for insights. It was outperformed by boosted-tree methods in terms of log-error, but held its own very well against less flexible models (being the best multilinear regression among my bootcamp cohort) and solidly accomplished its objectives of providing actionable insights into how potential house buyers in Moscow make pricing decisions.

“Fake News”: Detecting Bias on Wikipedia using NLP

Executive Summary:

My team partner (Rachel Kogan, a NYC Data Science Academy bootcamp classmate) and I filtered the Wikipedia English corpus down to bags of words using pySpark and trained machine learning models to detect bias (point of view tags) on TF-IDF data with nearly 90% accuracy.



The tide of news about fake news has reached a crescendo, and not many feasible solutions have been proposed for controlling the spread of biased information. The push for Facebook and other large social media sites on which most fake news seems to promulgate to manually editorialize their content is certainly infeasible for one. This problem seems ripe for machine learning approaches, specifically Natural Language Processing, so we set out to develop models that could detect bias in a large textual corpus. Due to its ease of access and pre-labeled content, we trained our models on the English Wikipedia corpus.


Wikipedia is one of the topmost visited websites in the English-speaking world and serves as a first go-to when looking for a quick overview on almost any topic. It’s surprisingly accurate and reliable given its anyone-can-edit ideology and site structure (roughly as reliable as the Encyclopedia Britannica according to a 2005 Nature study). Bots and users alike peruse the site for vandalism and other problematic edits, marking certain sentences, paragraphs, and entire articles as having a problem from a specific set of pre-defined tags. Fortunately for us, POV, or point of view (implying non-neutral point of view) is one such tag. There are several thousand articles in the English language Wikipedia with current POV tags, which provided a substantial enough body of data with which to train a bias detector. We downloaded the entire current-version (as of May 2017) English Wikipedia corpus in XML and stored it in the cloud using Amazon’s EC2 and S3 services. After separating the massive (~64GB) textual dataset into separate pages by the embedded XML page tags, we were able to filter the pages down to the usable articles (removing redirect pages, talk and other user pages, and simple list pages). The POV tags within each page’s text follow a replicable format (“{{POV … }}”), allowing for a regex-like filtering of the total articles down to just the articles with POV issues, and furthermore, articles for which the entire body of text is tagged as having POV issues (as opposed to just a paragraph or single line) also show a replicable special POV tag located in the header of the text, making filtering out articles with only very minor POV issues simple.


We then sampled from the rest of the corpus (presumably without POV issues) to generate training and test data sets with roughly equal proportions of non-neutral and relatively neutral articles. The XML markdown-style documents were then parsed with a set of regular expressions to generate a simple bag of words for each article (thereby removing all tags, including those that mark the article as having POV issues). We tokenized the bags of words and removed stop-words, followed by generating TF-IDF values for each unique term in each document (Term Frequency Inverse Document Frequency is a canonical way to calibrate simple bag of words term counts into relative term importances by adjusting by the rarity of each term across the entire corpus).


We trained many kinds of models to analyze which may be best at detecting bias in Wikipedia articles (and, by extension, in other sources of textual data as well). Among the best performers were logistic regression, random forest, and XGBoost (see metrics below).

Wikipedia model comparison.png


Additionally, we found that our models were fairly confident in their predictions, giving further credence to the idea that something as complicated as bias can be well handled by NLP and machine learning methods (x-axis is predicted probability, and y-axis is the number of such observations):

Wikipedia validation set predicted probabilities by model.png


We further investigated correlations in misclassification between these models to see if an ensembled approach would significantly improve our results. However, high correlations between our models limited the extent to which we could improve our overall misclassification rate through ensembling. As you can see below, any majority-voting metric would have misclassified at nearly the same rate.

Wikipedia misclassification.png

It’s All Greek to Me: Creating My Own Regex Writer

Link to the code on Github: utils_regex.R


Executive Summary:

I developed a library of trivial but useful regex-writing functions that make normally painful expressions faster to write and easier to read. I expanded the suite of typical regex functions to include others I wished had existed all along, mostly for reducing all the boilerplate code that comes along with certain types of expressions. I like using these functions because they make writing regex faster, reading easier, and debugging much simpler.



Regular expressions often look like chicken scratch to programmers who didn’t write those specific expressions themselves. After working with them frequently, I find them relatively straightforward to write but still unfortunately painful to read and understand. I created this suite of functions that build up regular expressions in easy-to-understand blocks so that other programmers who look at my code (including future-me) can easily understand what and how I was getting at with these expressions.


To start, why is there no simple regex remover function? Sure, you can write re.sub with repl equal to the empty string (gsub(replacement = “”) for the R programmers), but why all the boilerplate? Also, why are the patterns always written first, when the strings it will act on (especially given R’s piper) would make more sense? Well…


rem(strings, pattern, …) is a single substitution with an empty string. grem is the gsub version of that.


If I want to remove multiple things or do multiple substitutions from/on a list/vector of strings, do I really have to chain the expressions together (re.sub(re.sub(re.sub(re.sub(to infinity and beyond!)))) until the stack overflows? Or worse yet, copy-paste nearly the same line many times in a row with a new or identical variable name each time? Nope.


grems(), subs(), gsubs(), greps(), grepls(), regexprs(), and gregexprs() (the “s” is just indicating the plural form) do exactly that, but with a built in for loop to further reduce boilerplate your eyes don’t need when you’re already looking at regex. subs() and gsubs() have the added benefit of using a single named vector in R, so “USA” = “United States” would turn “United States” into “USA”. If you’re staring with two separate vectors, just rename the patterns with the replacements.


Do you have a set/list/vector of expressions you’d all like to test simultaneously? Just wrap it inside any_of(), which will make the “(x|y|z)”-like construction for you. It’s most useful if you have multiple nested or-bars.


Does finding a word need to be as ugly as “\\bword\\b”? I’ve lost count of the number of times I or an error message has caught myself having written “\\bob\\b” when I mean “\\bbob\\b” (the word bob), for instance. word(“bob”) does that.


If you’re removing certain words, you’ll often end with hanging punctuation that’s painful to remove. Why not combine all that into one step?


Removing everything that occurs before or after (but not including) some highly repetitive set of characters can sometimes cause catastrophic backtracking and other related problems, so I’ve also created some functions that make that same process easier and faster (by providing a few better, proper lines to avoid the one-line sub you’re/I’m liable to write on a deadline) while keeping a clean, unintrusive appearance.