An Basic Approach to Reinforcement Learning

Executive Summary:

Despite my lack of expertise with the topic, I used this approach to win 3rd place at a Trivago hackathon in NYC on November 11, 2017. The task was to improve Trivago in some way (an incredibly open objective), and most teams, including the one I pitched at led, focused on improving Trivago’s web portal and search listings to help grow Trivago’s market share. My idea was roughly to apply a basic reinforcement learning approach to change the static (non-learning) search result ranking algorithm into one that continuously improves its market segmentation based on user interaction with the site. My proposed algorithm ranks search results according to what has worked the best with demographically similar visitors in the past, along with enough randomness to continue learning (i.e. not get stuck in a local minimum). The Trivago judges mentioned that the general idea is one they’re already developing and that they liked my specific approach, which was a bit different than what they had considered thus far.

 

Motivations:

It is sensible to assume that everyone prefers better-rated hotels at better prices, but I’d expect some customer segments to be more price-sensitive than others. (Some people will always book the cheapest option; some will always book their favorite regardless of price; and some will carefully deliberate.) I think it’s also sensible to assume that this behavior will, to some extent, track with demographic information such as income. And further, since IP is a proxy for location, and location is a proxy for demographic information like income, Trivago and other similar companies should be able to use such information when ranking listings for users.

 

I created a multi-dimensional space (numpy’s ndarray being the natural choice) that keeps track of how price-sensitive and discount-sensitive demographic segments are: one dimension for income segments (log transformed), one dimension for age segments, and one dimension as a proof-of-concept placeholder for other segments like gender. A final dimension just divides the space up into price-sensitivity information and discount-sensitivity information, though I conceptualize it as a three- (or n-) dimensional space pointing to a two- (or m-) element vector. Such a system does not rely on particular values of m and n, which can be chosen based on what provides the most useful information (n being the number of demographic information types that are useful, and m being the number of sensitivity-to-hotel-aspect features that are useful). I’ll call this the “demographic preference space”.

 

I initialize a few parameters, such as the learning rate, the level of randomness, and a set of effect-size parameters that influence how much different types of actions (from greatest to least: creating an account, purchasing through express checkout, clicking on a hotel, or leaving without purchasing anything) affect the demographic preference space. While making a hotel purchase does directly indicate a user liking a particular hotel, there’s unfortunately no direct measure of a user’s dislike of a particular hotel, so I set the effect-size for leaving without purchasing anything as negative to “blame” all the hotels presented to the user a little bit for not being what the user wanted.

 

Each interaction changes the demographic preference space based on the learning rate and the effect size. When deciding which hotel listings to present to a new user, each hotel that corresponds to their searched terms gets scored based on how that hotel scores along each hotel aspect (price and discount) compared with the assumed preferences of the user based on the demographic preference space. In more detail, a local Gaussian average of price-sensitivity and discount-sensitivity are taken in the demographic preference space, centered at the user’s own demographic position. These sensitivities are passed through a hyperbolic tangent (tanh) function to provide reasonable bounds on their amplitudes. Each sensitivity (for instance, price-sensitivity) is multiplied against its relevant score from the hotel (for instance, price-score), and each of these terms is summed for a total score for the hotel for the user. As I also scaled hotel aspect scores (price scores and discount scores) from +1 to -1, low discount-sensitivity (-1) against low discount (-1) registers as a match, just as high discount-sensitivity (1) against high discount (1). A (relatively small, adjustable) random number is added to the score for each hotel to prevent the algorithm from getting stuck in a local minimum, and the hotels are sorted by score and displayed to the user accordingly.

 

The main proof of concept was showing that despite initial rankings being random, wealthier and less wealthy phantom users applying their preferences (toward nicer hotels and toward cheaper hotels, respectively) quickly (within a few hundred interactions) resulted in wealthier users being shown accordingly nice hotels and less wealthy users being shown accordingly cheaper hotels.

 

One major challenge of this competition was the lack of any user or hotel data, which I had to write functions to generate for this proof of concept. That, combined with the competition’s duration of only one day, prevented any more sophisticated approaches, though I had thought of a few. The algorithm as I submitted it does not make any outright assumptions about hotel preferability, instead just following what the users pick. Perhaps a hybrid approach could improve results. There were also plenty of other hotel aspects, most notably rating (and users’ sensitivities to these hotel aspects, e.g. rating-sensitivity), that could have easily been included in the demographic preference space. And despite using some nuts and bolts of neural networks (such as the idea of a learning rate, vaguely convolution-like processing of the demographic preference space, incrementation of weights based on fit/error assessment, and use of functions like tanh to prevent “blowing-up activation”), there was no neural network component to my solution despite being well-suited for such an approach. Had we more time, I would have pursued such low-hanging fruit as next steps.