In this recruiting competition, Facebook provided anonymous data about users of an auction site. The task was to create a model that determined which user is human and which robot. The human users of an auction site do not want to have to compete with machines. A site with many robots can experience an exodus of human users, so it is important to have a robust system for detecting suspicious users.
According to the administrators, this data did not come from Facebook, but I think the interest was to create an environment similar to their advertising auction system.
This competition has become quite special for me, since it was my first Top 10 on Kaggle, and I ended up becoming Kaggle Master. This was my first goal, but it also has the added bonus of being considered for an interview for the Data Scientist position at Facebook.
One of the most interesting parts of this competition, which made it very close to a real case of data science, was the fact that we did not have a raw file with variables related to the objective that could be used directly with some model of machine learning.
Two files were available: the first contained about 2000 rows with anonymous information from users, such as ID and Address. The second contained more than 7 million bids made on the site.
Variables in the second file included: Bid ID, User ID, Auction ID, auction category, user device, time, country, IP, and URL. The time was coded, so it was not possible to find out the exact date of the bids.
Having a robust validation environment is essential for any machine learning task. In this case there was the challenge of having only 2000 examples to train and validate. After testing several combinations, I decided to use a stratified cross-validation, with 5 folds, and repeat this validation 5 times.
The repetition ends up causing the examples to be randomly placed in different folds, which helps stabilize the error estimate, since one or another particular example will have less influence on the mean.
The most important part of any data science job is to ensure that you have the correct data to do the prediction. A very common mistake is to think that you just put the data in the algorithm, and it will learn what you want. This was by far the most important part of my solution.
The first step was to create variables with the count of how many bids each user made. Extending this idea, I created variables for counting how many different devices, IPs, categories, countries, and URLs the same user had used. With these simple variables, a Random Forest, without tuning, reached 0.86 AUC in cross-validation with 5 divisions.
In the case of countries, in addition to integrating the variable itself to count the number of different countries, I made individual variables for each of them, counting how many bids the user made coming from each country. This helps the model identify “risky countries,” where it may be easier to host robots to bid.
After seeing that the counts had great predictive value, I decided to test simple statistics, such as averages and deviations. Examples of important variables:
– Average and standard deviation of bids per auction;
– Maximum number of bids given in the same auction, with the same timestamp;
– Average time between bids.
The variable that indicated the time in the original data was coded so that you could not know the exact date that the bids occurred. But after analyzing the data, I realized that even though I did not know the exact date, it was possible to identify what would be days, hours and minutes.
This allowed me to create variables related to averages, deviations, proportions and counts based on units of time. Some important variables:
– Standard deviation of bids per day;
– Average bids per day;
– Average bids on the same timestamp;
– Time-of-day bid count;
– Average auctions that the user participated in the same day.
It seems that all the participants who finished at the top of the LB were able to identify these time patterns.
Some proportions based on counts and time were also important. Some examples are:
– Maximum proportion of total bids in a day assigned to the user;
– Density of bids per hour of day
When you have variables that contain enough information about the event, the selection and adjustment part of the model’s parameters becomes simpler. In this case, despite briefly testing linear models, I ended up choosing to better explore models based on decision tree ensembles.
The first model I tested, and the one that proved itself most useful in this case.
I believe that its superior performance was given by being a model more directed to the reduction of the variance. Because we had little data to train, it was important that the model was able to cope with extreme values, and stabilize predictions.
A Random Forest with 500 trees and parameters adjusted by cross-validation had 0.9112 AUC in the local measurement, and 0.9203 in the LB.
Gradient Boosted Trees
This is one of the most used models in data science competitions because it offers a good performance in several types of tasks. In this case it was not so. My best models were around 0.90-0.91 AUC on LB despite reaching 0.94 on validation.
This raises the question: we have a decision tree-based model that gives us a good performance (Random Forest) and, in theory, boosting would improve the score, why is it not the case here?
My answer is: Random Forest increases the bias, that is, it makes the model more rigid, with less variance, while the GBT increases the variance, being more sensitive to the noise present in the data. In this case we have less than 2000 examples, and a strong imbalance between classes. These two combined reasons generate data with a good amount of variance, increasing it further will cause overfitting.
The solution would be tuning and regularizing the GBT, but with little data to validate it is difficult to trust the results. So I decided to rely on theoretical reasons and continue working with Random Forest to have a more stable model.
The final touch of most of the best competition solutions is to make an ensemble with the best models.
Due to the small number of examples, it was difficult to rely on the parameters found through validation. So I decided to create an ensemble with Random Forests randomizing the parameters. Each had a seed and a few different parameters. This ensemble achieved an AUC of 0.92 in the out-of-bag estimate and in LB.
AdaBoost Random Forest
Despite the instability of the GBT, I decided to apply boosting to Random Forest. Perhaps applying this method to more stable models than decision trees could help, even with the small number of examples. And that’s exactly what happened.
In the validation it obtained 0.925 of AUC, and 0.928 in LB.
There was still a difference between the validation score and the LB when I added new variables.
Stabilizing the Seeds
To stabilize the predictions I decided to run the Boosted Random Forest model using several different seeds, and make the average. This did not give a good result in LB, but it had stability in the cross validation, which made me trust the model.
In the end, I used a Boosted Random Forest model with a set of attributes that presented AUC 0.94 in cross-validation, and it was basically my score in the final test set of the competition, which gave me the 6th place.