Using Machine Learning To Predict Which User Will Click An Ad

Avazu is an advertising platform that delivers ads on websites and mobile applications. One of the biggest challenges in the industry is determining which ad is most relevant to the user. If your ad is in accordance with the user’s interests, the chance of click is higher, and this increases the profits of both the platform and the advertiser, which will have a potential customer visiting his site.

Machine Learning have been used by several companies delivering ads on pay-per-click format. That is, they are paid by the advertiser for each click on the ad. Among the applications, the creation of intelligent models to calculate the likelihood of a user clicking on the ad, based on:

Profile information : location, device used;
Advertiser information : ad category, size, and color;

And also data about the site or app in which the ad is displayed.

In this repository I make available the original code for the implementation of FTRL Proximal by Kaggle’s user tinrtgu, based on the paper developed by Google, as well as the modified code to obtain my best single model.

Data Description

In this competition the Avazu made available a dataset with the log of 10 days of impressions and clicks with information about users, advertisers and display platforms. The goal was to predict the likelihood of a visitor, on an app or site, clicking on the displayed ad.

A day after the training set was kept as the test set to evaluate the model. The training set had about 40 million lines, and the test about 4 million. Furthermore, it was a high dimensional problem, with more than 1 million independent variables.

The metric chosen for evaluation was LogLoss, which heavily punishes inaccurate predictions that are very confident. We should keep in mind that there was an imbalance between the proportion of positive and negative examples, and this metric favors correctness in the most represented class (in this case, the negative ones).

Baseline model

Predicting a probability of 50% for all examples provided a log loss 0.6932. The first model I tested was a logistic regression with Vowpal Wabbit without regularization and only 1 pass over the data. It resulted in a log loss of 0.3993.

Cleaning the Data

Due to the large number of features and values, some of them didn’t occur frequently in the dataset, and this can hurt the classification. Logistic regression is sensitive to outliers in the features, so removing them could improve performance.

I tested two strategies: removing the outliers, or grouping them under a “rare” variable. The choice that significantly improved log loss was to group the values of features that were present in less than 5 examples.

The same logistic regression, without any optimization, applied to the “clean” dataset reached a log loss of 0.3954.

Validation Issues

In this competition one of the big problems was finding a suitable way to create a reliable validation set.

These data change rapidly: new variables come in, and old variables are no longer used in a matter of days or weeks, so it is natural that there is this difficulty in reaching a consensus on the exact performance of the model in the future.

Some ways were discussed in the competition forum, and two of them were widely used:

Progressive Holdout

In this modality, for each N examples used to train the model, the log loss was calculated in example N + 1. After going through the entire dataset, the average of these scores is calculated. In my attempts this proved to be a rather optimistic method.

Validation in the last days

Because this data has a time dependence, the model was trained excluding the last, or the last two days of the dataset. And then they were used for validation. This alternative showed itself more reliable than the other, but was not very robust to the choice of interactions between variables, so it was still necessary to be careful about overfitting in these cases.

Feature Engineering

All features of this competition were binary, and some were anonymous, so I had no information about what they represented.

I tried to count how many times each value of a variable appeared in the dataset and use this value as a new feature. It did not work, and it worsened performance. Although it worked for other teams.

An alternative that showed improvement, but nothing significant, was to create a variable indicating if a given example had more or less than 100 occurrences.

Interaction between variables

Another alternative was to create interactions between two or more variables. This was done simply by creating a variable that indicated the presence of combinations between values. At first I tried the interaction between all the pairs of variables, which worsened performance.

I made another attempt, this time manually creating combinations that seemed relevant to me (representing an user’s access to an app, for example). And they eventually improved the model, reducing the log loss to 0.3889.

Models Trained

Logistic Regression with Vowpal Wabbit

I took this opportunity to learn about a new tool: Vowpal Wabbit. A fast implementation of Machine Learning algorithms that minimize convex functions for both regression and classification. Just to keep in mind, the “pure” logistic regression, without data cleansing, had a log loss of 0.3993. After cleaning this number dropped to 0.3954.

Calibrated SVM with Vowpal Wabbit

Since it was necessary to send a list of probabilities to Kaggle, I tried to use the distance of the data to the SVM hyperplane as inputs, both for a logistic regression (inspired by Platt’s scaling) and for an isotonic regression, available in scikit-learn.

These are two popular ways to calibrate an SVM so that it gives us probabilities. None showed a good improvement in the score.

Follow the Regularized Proximal Leader (FTRL Proximal)

Google Paper – Ad Click Prediction

This was the algorithm that helped me greatly improve the log loss. It was implemented by one of the competitors and made available in the competition forum. Developed by Google for the same task of calculating ad click probability based on user information, it creates a more sparse representation of the data, and ends up being more robust against outliers. In this paper the author describes the implementation and characteristics of it compared to other algorithms used for the same task.

It can be said that it is basically a logistic regression with adjustments to not need to use as much memory to store the weights, and also an optimization method that forces the least significant weights to take the absolute zero value. That is, in a problem with millions of variables like this, disregards those that are not really important and, in this way, automatically selects features.

With the clean data, a slight tuning, and 3 passes on the data, this model hit a log loss of 0.3925.

Neural Networks

I tried to use neural networks both in Vowpal Wabbit and my own implementation. In Vowpal Wabbit the neural network hos only one hidden layer with sigmoid activation. It didn’t show an improvement.

I created, in Python, a neural network with ReLu activation units. This type of activation has been widely used in the area of Deep Learning, due to the fact that it does not have the problem of exploding or disappearing gradients, besides favoring sparse data representation. In some cases the result is equivalent to, or better than, networks with stacked unsupervised layers.

I only used a hidden layer, and in this case, there was an improvement in the validation data, but it did not translate into an improvement in the official competition score. Maybe using more than one hidden layer, and getting into the Deep Learning area would have helped, but I did not have time to test it.

The best neural network, in the original clean data, with 30 neurons in the hidden layer, and ReLu activation, reached the log loss of 0.3937.

Hashing trick

In a problem with many independent variables like this, storing weights for all of them in memory becomes an issue. Although this is a small dataset when we talk about big data, it already requires the use of the hashing trick.

This technique consists in hashing the features values and assigning weights to the buckets, instead of directly to each column. In this case, because we only have binary variables, it is fairly easy to use.

After the hashing is done, at each iteration of training we update the weights of the buckets. There is a risk of collision, but the higher the number of buckets, the lower the risk.

In practice there is no significant loss of performance for this reason, so this has become a widely used technique in problems involving high dimensionality. Both Vowpal Wabbit and Proximal FTRL use this technique.


It is almost impossible to score well in a Kaggle competition without putting many models together in an ensemble. Basically, if you have models that have similar accuracy but different errors, there is a good chance that if you join their predictions, you will have better performance.

Models in partitions of the data based on independent variables

The first attempt at ensembling was to create individual models for certain partitions of the dataset. For example, create 24 datasets, each containing only the samples at a given time, or create a dataset for each application ID, and train individual models in them. Then subdivide the test set in the same way and get predictions from these individual models.

In some cases, if partitions are made into groups that actually differ between them, performance can improve greatly.

After creating several models based on subgroups that seemed to be different between them, the final ensemble, taking the simple average of 8 models, reached a log loss of 0.3935.

Models based on random interactions of the variables

Another alternative to create ensembles is to use different variables for each model. After checking that some interactions improved the score using Proximal FTRL, I decided to create a script that tested the effect that an interaction would have on the ensemble score. Although interesting, one should be careful about overfitting.

Each model selected an interaction and tested whether the performance of an ensemble between it, and some model with other interactions, improved. This attempt generated 5 linear models with different combinations of variables, which together, taking the simple average, reached a log loss of 0.3888.

Putting together the best models

In the end, I put together several models that showed a good score, and reached the log loss of 0.3878 that guaranteed the 42nd position, among the top 3% solutions.

The difference between the log loss of this solution and the winner was 0.0087.

Other ideas and the winning solution

After the end of the competition it is common for the participants with the best positions to disclose their models and ideas that contributed to achieve a good performance.

In addition to the methods described here, two approaches stood out: attributes that took into account temporal factors, such as the number of visits a user made to a site or application in the last hour; number of visits on the same day; and using attributes related to the count of times the variable, or interactions between variables, appeared. Knowing how to use the temporal characteristics of the data seems to have favored the winners.

Leave a Reply

Your email address will not be published. Required fields are marked *