AirBnB is a technology company that offers a virtual environment where users can book accommodations to stay in various places in the world, and also advertise their properties to the travelers’ community.
Trying to find candidates to compose their team of data scientists, they decided to sponsor a competition in which the goal was to predict which will be the first country a new user will make their reservation.
The most interesting part for me of this competition is that it was possible to reach a good position (Top 7%) with only one model, which could be used in production right away.
About the Data
Anonymous data about the user profile, such as age, gender, account creation date, language, and the channel used to reach the site were made available. We had data since 2010.
In addition, data about user sessions, identified by the id, that described the action performed (click, a message to the owner of the property, visualization of search results, etc.) and how many seconds passed between that action and the previous one. These data existed only since January 2014.
There were two other files, with data on the characteristics of the countries, but I found no use for them.
The metric used was NDCG. It basically goes from 0 to 1 and measures the relevancy of the model results sorted by their predicted score rank. More information on this link .
A user could choose between 12 countries, but almost 90% of them went to the United States or did not make a reservation.
It soon became clear, due to the metrics used, that it would be better to focus on modeling whether or not the user makes a reservation, leaving the destination in the background.
The data provided for training and validation referred to users who created their accounts before July 1, 2014. And the test data on the leaderboard was three months after this date.
Several participants used cross-validation, randomly shuffling the examples, but because of the time to run the model during development, and also because the data characteristics depended on time, I decided to use May and June 2014 data as validation.
Some characteristics of the data, such as the proportion of users who had session data, changed over time, so I decided to use a period quite close to the test to validate. And this period proved itself quite robust.
I used the standard user profile variables already described above. Since I planned to use a model based on decision trees, I made each categorical variable into ordinal, as these models are able to find patterns even though these variables do not have a real order.
In addition, I tried to compute the amount of time a user spent between sessions, and how many different types of actions he performed on the site.
I extracted basic date information, creating individual columns for the day, month, year, and day of the week. This helps the model capture seasonal effects of the time series.
This is an essential part of most e-commerce models. In this specific case, I’ve created a column with the amount of time the user spent on each action that he performed on the site. Also, I did the same procedure to calculate how many times the user executed a particular action.
Machine Learning Model
I used a Random Forest with 300 trees that was good enough to win a place in the Top 7% (91 of 1463).
In total, I had about 1100 columns, and most of them quite sparse (most of the values were zero). Some people say that models based on decision trees do not do well with this data format, but my experience suggests that it is a matter of finding the right parameters.
Unfortunately, I did not have time to train a good model of Gradient Boosted Trees, which would certainly have a better performance, and because of the small difference in scores between Random Forest and the top of the leaderboard, it would almost certainly be good enough to finish in the Top 10.