Categoria: Kaggle

How to Win the Machine Learning Competition of Australia’s Largest Telecommunications Company in Just 19 Days

Telstra, Australia’s largest telecommunications and media company, sponsored a contest at Kaggle seeking data scientists that would stand out as part of a recruiting process to join their Big Data team.

Service log data from their network were made available, and the task was to create a model that could predict the severity of a failure at a particular location and time, using the data from these reports.

In total, 974 competitors from around the world participated, and this victory put me in 12th place (of 50,000+ users) in the global Kaggle ranking.

About the Data

There were about 7,400 training examples and 11,200 for testing. A small amount, which requires special attention during the creation and validation of the model.

Each example had an ID, which represented a pair between event location and time. In addition, other files were provided that contained, for each example: information on signals obtained by the current severity monitoring system when processing the logs, type of event, and severity of messages sent by it at that time.

The severity of the problem had three categories: 0, meaning that there were no problems; 1, meaning that some problems were detected in the network; And 2, representing many reported problems. Therefore, it was a classification with multiple classes.


Although there is a temporal and spatial dependence between the data, since they are from different locations, at different points in time, the data was randomly split by the organizers. It was possible to know the location, but not the time when the record was obtained.

So a simple cross-validation was the best environment I found to test my model. Cross-validation is simply the procedure of dividing your data into K parts, and doing a training cycle, using K – 1 to train and evaluating in the division left out.

In this case I used 5 divisions, which provided a fairly reliable estimate of the performance on the unseen test data revealed after the end of the competition.

Feature Engineering

In total I had 242 variables.

“Magic” Variable

There was a pattern in the data, which would later be called the “magic variable” in the forums, which allowed us to exploit what seemed to be a mistake in data preparation to achieve a good improvement in model accuracy.

Although the training and test files have random IDs, the ordering of the records of the other files had predictive value. With this, you could use the ID of these files themselves, or create variables related to it, to exploit the leak.

Some people criticized the fact that the best solutions used this error to get a better score in the competition, but I think it’s important to be able to find this type of error in data, because the only thing that would change out of a competition is that it should be corrected, rather than exploited.

Therefore, it is always important to look for errors in the data. There are several types of information leaks, some very subtle, that can make your model look better than it actually is.


There was a variable indicating the location of that record. This, in theory, is a categorical variable, which means it has no natural order, but I still used it as an ordinal, instead of creating a column for each value, for two reasons: trees can capture the patterns of categorical variables even using this format (in some cases it is better to use one hot), and there could be some pattern of proximity between sites.

For example, it may be that location 201 and 202 were under the same weather conditions, which could influence the occurrence of problems in these two stations, so perhaps there was an implicit cluster there.

Log Variables

The service logs had what seemed to be a line for each warning signal, and a value indicating how many times the warning was triggered at that time. The organizers have not made it clear whether this was the meaning, but it is the most plausible explanation.

With this, I created a column for each of these signals, and the value for each example was the number of warnings.

These columns were quite sparse (more zeros than other values), so most of them ended up being useless, but some had predictive value.

Final Solution

After being satisfied with the performance of my best individual model, I decided to create an ensemble. That is, create several models that capture different patterns in the data and complement each other.

Among the 15 models that composed the final solution, two stand out: Gradient Boosted Trees (in the implementation of XGBoost) and Neural Networks (using Keras).

Gradient Boosted Trees – XGBoost

Anyone who is used to Kaggle knows that most of the competition solutions involve Boosted Trees. This model is quite powerful by itself, but the parallel implementation in the XGBoost package has proven to be very good at solving machine learning tasks in general.

In this case, my best “individual” model was one of these and would have finished in 18th place. There was still a lot that could be improved on it, but since I had only a week to go, I knew that without an ensemble it would be a lot harder to win, so I decided to end the single model optimization at this point.

Neural Networks

A model that finds good solutions, but using a different strategy than decision trees, is a neural network. Although it did not demonstrate a performance comparable to XGBoost, it was a great addition to it in the ensemble.

I used the Keras library, which is fantastic for creating neural networks in Python.

I tried to normalize the data with all the transformers of scikit-learn, but the best was the standard (StandardScaler), which subtracts the mean and divides by the standard deviation. In addition, I used two layers of neurons and the optimizer Adam. To find a good architecture, dropout coefficient and parameters in general, I did a random search.

Neural networks seem pretty sensitive to “everything,” so I believe there are different combinations of architecture, data normalization, and variable encoding that would give equal or better results.

One of the reasons it did not get better was because I did not create a specific set of features for the neural network. I used the same set of variables from the trees. If this was my main model, I would have done it differently, but since it was just a component of the ensemble, this way was enough.


This was a brief summary of 3 weeks of work in this competition.

My initial plan was to create just one model and see how far I could go, but seeing that the model was good, I decided to try to reach my first individual victory.

Using Machine Learning to Predict the First Destination of 60,000 AirBnB Users

AirBnB is a technology company that offers a virtual environment where users can book accommodations to stay in various places in the world, and also advertise their properties to the travelers’ community.

Trying to find candidates to compose their team of data scientists, they decided to sponsor a competition in which the goal was to predict which will be the first country a new user will make their reservation.

The most interesting part for me of this competition is that it was possible to reach a good position (Top 7%) with only one model, which could be used in production right away.

About the Data

Anonymous data about the user profile, such as age, gender, account creation date, language, and the channel used to reach the site were made available. We had data since 2010.

In addition, data about user sessions, identified by the id, that described the action performed (click, a message to the owner of the property, visualization of search results, etc.) and how many seconds passed between that action and the previous one. These data existed only since January 2014.

There were two other files, with data on the characteristics of the countries, but I found no use for them.

The metric used was NDCG. It basically goes from 0 to 1 and measures the relevancy of the model results sorted by their predicted score rank. More information on this link .

A user could choose between 12 countries, but almost 90% of them went to the United States or did not make a reservation.

It soon became clear, due to the metrics used, that it would be better to focus on modeling whether or not the user makes a reservation, leaving the destination in the background.


The data provided for training and validation referred to users who created their accounts before July 1, 2014. And the test data on the leaderboard was three months after this date.

Several participants used cross-validation, randomly shuffling the examples, but because of the time to run the model during development, and also because the data characteristics depended on time, I decided to use May and June 2014 data as validation.

Some characteristics of the data, such as the proportion of users who had session data, changed over time, so I decided to use a period quite close to the test to validate. And this period proved itself quite robust.


User Profile

I used the standard user profile variables already described above. Since I planned to use a model based on decision trees, I made each categorical variable into ordinal, as these models are able to find patterns even though these variables do not have a real order.

In addition, I tried to compute the amount of time a user spent between sessions, and how many different types of actions he performed on the site.


I extracted basic date information, creating individual columns for the day, month, year, and day of the week. This helps the model capture seasonal effects of the time series.

User Behavior

This is an essential part of most e-commerce models. In this specific case, I’ve created a column with the amount of time the user spent on each action that he performed on the site. Also, I did the same procedure to calculate how many times the user executed a particular action.

Machine Learning Model

I used a Random Forest with 300 trees that was good enough to win a place in the Top 7% (91 of 1463).

In total, I had about 1100 columns, and most of them quite sparse (most of the values were zero). Some people say that models based on decision trees do not do well with this data format, but my experience suggests that it is a matter of finding the right parameters.

Unfortunately, ​I did not have time to train a good model of Gradient Boosted Trees, which would certainly have a better performance, and because of the small difference in​ scores between Random Forest and the top of the leaderboard, it would almost certainly be good enough to finish in the Top 10.

Classifying Customer Visits to Walmart in 37 Categories Using Machine Learning

Defining a customer’s intent when visiting a store, whether real or virtual, can help a company deliver a personalized experience. That’s why it’s important to use tools that can help categorize and identify these visits.

Walmart has provided anonymous trip data for its customers to some of its stores. Among the data were the items purchased and their respective quantities, as well as descriptive information about the products and the day of the week. The task was to use this information to categorize the intent of that client’s visit.

Despite not giving us a reference table for each code, the organizers said that some examples of categories of visits are: weekly purchase of food, gifts for a celebration, or purchase of clothes for the new season.


There were about 95,000 trips for training, and 95,000 for testing. The file contained basically one line for each product purchased during a visit, and it was the competitor’s responsibility to properly transform it to merge the data for each trip.

The following information defined the products: the UPC code, with more than 100 thousand different types; The Fineline code, which is a refined category created by Walmart, with about 5,000 types; And the Department, which contained about 60 different types.

In addition, the ScanCount variable defined how many products of that type were purchased, and if it was a negative number, the customer was returning products.

Transforming the Variables

In order to train a model it was necessary, as a minimum, to group the basic information by trip. Several reports on the forum talked about transformations resulting in thousands of variables.

As I only had 7 days to work in this competition, my goal was to create a simple and compact model, with the fewest possible variables, but with a good performance to position me among the top 10%.

Basic Variables

Examples of variables based on statistics of each trip are: the average amount of each product purchased during the trip, indication if there was a return of product, sum of the quantity of all products purchased and the department with the highest quantity of items purchased.

Counts and Proportions

I decided to add the amount of products purchased by department, and use both the count of each department and the proportion that this department occupied in this purchase. This created about 120 variables.

Singular Value Decomposition + TF-IDF

To try to use Fineline and UPC without increasing the number of variables, I decided to add the quantity of products in each of them and make two transformations.

First, TF-IDF, which replaces the quantity by a weight relative to the proportion of an item on that trip, and how often is the presence of this item on other trips.

Then I applied the SVD, which tries to find the components that have the most variation in the data.

These transformations are usually used with text, applied to the word count in each document, and known as Latent Semantic Analysis in that context.

In addition to reducing the size of the data, it is expected to remove much of the noise, and find structural categories to which the trip belongs.

In practice this helped a lot with Fineline, but it did not help much with UPC.

Logistic Regression L1 and L2

Another way to reduce size is to train a simpler model and use its predictions as a variable in the main model.

For this, I trained two logistic regressions: one with L2 penalty and another with L1 penalty. This generated about 70 variables, 37 for each regression, with the probability of an example belonging to each class.


Gradient Boosted Trees – XGBoost

Most of my time was focused on building a good model with XGBoost. This model was already good enough to stay in the top 10%.

Neural Networks

To complement XGBoost, and try a better position, I decided to train a neural network on the same variables. It had 2 hidden layers and dropout.

Other competitors reported good results with neural networks, but I did not spend much time tinkering with them. The intention was only to achieve a slight improvement over the result of XGBoost.

Possible Results and Improvements

The final solution was a simple ensemble between a neural network and XGBoost, which was enough to secure a position among the top 7%. The solution took about 8 hours of work.

Using only the Gradient Boosted Trees model, without ensemble, it was possible to stay in the Top 9%. This model had only 200 variables. Most of the better models had more than 5000 variables. Certainly increasing the number of variables could make this model position above the Top 5%, but it would take a lot of time to train.

Other possible improvements would be: to increase the ensemble, to create a set of variables optimized for the neural network, to use directly the variables with Fineline counts, to better tune the parameters of the logistic regressions and the transformations.

How We Won the Caterpillar Machine Learning Competition at Kaggle

Caterpillar is a manufacturer of industrial equipment such as tractors and engines. To maintain their operations, they need to buy tubes of various specifications, from several different suppliers, to use in their production line. Each supplier and product has a different pricing model.

The task in this competition was to create a model that would be able to price the tubes using historical supplier data and product characteristics.

I had the pleasure of competing on the team that won this competition, surpassing more than 1300 teams of data scientists from around the world. In this article I want to describe the most important parts about creating the solution that secured this victory.

Having a better familiarity with my part of the solution, a good part of this article refers to it, but I want to make it clear that the winning solution was created using different models developed individually by each member. None of us would have managed to win alone, and all were essential for the team to reach first place.


Several files with characteristics of the tubes and the suppliers were available. There are two basic categories of pricing: fixed price and variable price according to quantity.

There were about 30,000 lines for training and 30,000 for testing. Something important to consider is the fact that several lines referred to the same tubes, changing only the minimum amount needed to purchase to get that price.

Among the available variables were: quantity, tube ID, date of purchase, annual use forecast, diameter and length.


This is one of the most important parts of any data mining task. It is essential to create a validation environment that has the same or very close characteristics of the production environment. In our case, the test data.

This means that it would not be enough to simply shuffle the lines and distribute them in folds as in “standard” cross-validation. In this case, I decided to create folds based on the tube IDs. In cross-validation, instead of distributing rows randomly, each fold contained all rows of a given tube.

Although it was a time series, training and test data were not split between past and future, so taking the date into account to define the validation data did not bring benefits.

This validation scheme proved to be quite robust and close to both the public leaderboard and the evaluation in the test data (private leaderboard).

Feature Engineering

After a robust validation environment, the most important part is finding data characteristics that can be transformed into variables to feed the model.

I will present transformations that, in general, can be used with other data, so they are not specific to this task.

Ordinal Representation of Categorical Variables

The most popular implementations of decision trees based models available in Python treat all variables as numeric. In this case there are some alternatives, such as one-hot encoding, so that the model can find unique characteristics of the levels of categorical variables.

But by the very nature of the model, coding variables ordinally allows decision trees to approximate the value of each category as if it was coded in the correct way.

In this case we have a categorical column in which each level is represented by a number.

Record Counts

The number of times a record appears in the data can also have predictive value.

A common, and quite useful, transformation is to replace the categorical values ​​by counting records that belong to that level. In this case there was a significant improvement in the error.

Minimum and Maximum Quantity Available

Utilizing the fact that about 90% of the tubes had their price varying with the purchased quantity, I realized that the more expensive tubes had a lower maximum purchase quantity than cheaper tubes.

Most of the more expensive tubes had their lowest price achieved when buying 5 units, while the cheaper tubes went until 250 units.

This attribute was very important, and demonstrates the value that exists in trying to understand the data and the task in question.

Information Leaks in IDs

This is the situation where we have information in the training data that allow us to discover patterns in the dependent variable, but which should not exist in reality.

In this case, each tube was identified by a number, and when sorting the data using this variable, I realized that there were small tube clusters with similar prices and IDs. This means that the ordering of the tubes had a predictive value.

To use this information, I simply created a variable with the numeric part of the IDs.

Both in competitions and in practice, it is important to look for these patterns. In the first case, this can help you achieve a better result. In the second case, all information leakage must be removed so as not to compromise the performance of the model in a production environment.

Weight of Components

Each tube had different components that would be attached to it. The characteristics of each component varied, but one of them stood out, the weight.

With access to the quantity and type of component used in each tube, I calculated the sum of the weight of the components of each tube, and this was a variable with a significant correlation with the price, and small correlation with other features.

Machine Learning Models

Gradient Boosted Trees

Famous for his good performance in competitions of different areas, this was my best model. I used the XGBoost implementation, which is one of the best open source versions I know of this model.

XGBoost has the option to optimize the RMSE, but as we are optimizing RMSLE in this competition, ie the logarithmic error, I made the transformation of the dependent variable with the log (y + 1) function. There were suggestions to use the root of the variable instead of the log, and my best model ended up being trained on the variable transformed with the 15th root.

This model, alone, was enough to get the 10th place.

Regularized Greedy Forests

This was the novelty of this competition. Although I tested this model in the Facebook competition, I had never dedicated time to make it work.

The algorithm is basically doing the following: it optimizes a loss function equal to Gradient Boosted Trees, but with a regularization term. In addition, it periodically readjusts the weights of each terminal node of the ensemble trees.

In theory, it’s an improvement over Gradient Boosted Trees, but in this competition it did not present a better performance. I believe that with more time to adjust the parameters and train it is possible to achieve a better result.

Other Models

The team trained two other types of models to compose the ensemble, but they alone did not perform well:

– Neural networks
– Factorization Machines

Ensemble (Stacking)

The ensemble was done in three levels, using models of all the members of the team.

At the first level, the models received the original data and made their individual predictions on data outside the training sample in the same cross-validation scheme.

At the second level, four models used predictions created at the previous level as training data, and again made predictions using a cross-validation scheme to avoid overfitting.

At the third level, the mean of the predictions of the second level models was used as the final forecast.


In this competition, it was extremely unlikely to win without forming a good team. The use of solutions created by different members, which eventually improved the predictions in the ensemble, was essential.

In addition one of the most important factors for winning a Kaggle competition is time. The more time you have to test ideas, the greater the chances of success.

Determining Product Relevance in a Search Engine With Machine Learning

One of the most important tasks for an e-commerce site is to ensure that users find the products they are looking for. This includes a search engine that returns relevant products. In this competition, CrowdFlower, a specialized platform for data collection, made product search data available, and the median relevance score given by three people for each pair of product and search term.

The goal was to create a model that could calculate the relevance of a product to a given search phrase.


Approximately 10,000 training records were available, and 20,000 were separated for testing. In these records there was the search term, title and description of the product, the median of relevance and the variance of the relevance attributed by the users.

In the test samples there was a portion of “false” records, which were not used to calculate performance, but were placed to avoid human labeling.

Evaluation Metric

The evaluation metric was the Quadratic Weighted Kappa. It measures the disagreement between two “evaluators”. In this case, it measured the disagreement between the score given by the model and the score given by humans. It is sensitive not only to accuracy but also to the score (label) distribution. A random solution would result in a Kappa of 0, and total agreement, in Kappa equal to 1, but there are cases where Kappa becomes negative.

Data Processing


The basic transformation of title and description is to make a bag of words. In its simplest form it is a matrix whose rows are the examples (documents) and the columns are words (terms). The value of each cell is the number of times the word appears in that document.

A more advanced transformation is to take this matrix and do some operations using the frequency of terms in each document, and also in the document body. This is called TF-IDF. In this case, the most frequent terms within a specific document receive a greater weight, but that is discounted by a factor inversely proportional to its frequency in other documents.

Put simply: Rare words that appear a lot in a document are given more weight, and frequent words that appear in all documents become less important.

Other than that, it is important to determine a minimum limit of times that the word should appear in the body of documents, to become a column, in order to reduce noise.

The best representation I found was to train a matrix based only on titles, and another only in the description. Some participants suggested training in both at the same time, but since this is a search problem, I believe it is important to differentiate them.

SVD – Latent Semantic Analysis

As you might expect, the matrix of documents is huge, and it’s pretty sparse. This makes it difficult to use with some non-linear models. In this case there is the option of reducing dimensionality using SVD. This linear algebra algorithm basically finds the main components of the document matrix, where there is greater variation. This reduces the dimensionality of the problem, but can cause loss of information.

In Natural Language Processing they end up working well, because they filter a bit the noise of the data. Applying the SVD to the TF-IDF matrix gives you a representation known as Latent Semantic Analysis. The interpretation is that each component obtained refers to a “concept” represented by the documents. It is also widely used to visualize documents in a space with small dimensions.

In the competition I used SVD to reduce the dimensionality of the TF-IDF matrix before feeding it to the models.

Other Attributes

In addition to the above features, I created others that seemed to make sense, and some that I found in related academic papers. The most important of these was the percentage of search words that were in the title. In addition, I created attributes based on the average similarity between a product and the others of the same search term, as well as the count of times the search term appears in the data.

An important part was coding each search term with one hot. That is, for each of them I created a column that had the number 1 if it was present, and the number 0 otherwise. I did this to help the model capture the predictive value inside each search term.

Regression or Classification?

There were two ways to look at this task: to consider each level of the median as a class, thus being a classification problem, or to consider as a regression problem and to round out the predictions.

At first I thought it was more appropriate to treat it as a regression, but since Kaggle wanted the rounded predictions, I ended up creating several classification models and did not have time to test the other alternative.

With the classification we have two options: use the class provided by the model, or use the probabilities of each class to make a weighted average to get the score. The first alternative proved to be better in this case.


Support Vector Machine

The best models were obtained using SVM. It is one of the most used models for text classification, so it is no surprise that it performs well.

The best individual model, using the SVD components and the other attributes, reached a Kappa of 0.672 in the cross-validation, and 0.654 in the LB.

Random Forest

One of my favorite models, but not usually the best choice for sparse data in high dimensions. Anyway, after several tests, I got 0.6695 in cross-validation, and 0.6508 in LB using a TF-IDF matrix with the other attributes.

One reason that may help explain the fact that it behaves better with the TF-IDF matrix than with the SVD matrix is ​​that this is a method that reduces variance, so it turns out to be robust even with many attributes, as long as they are relevant to the prediction.

Gradient Boosted Decision Trees

Although this model is also not very common for use with sparse data in high dimensions, I decided to train several “weak” models and stack using their predictions and the other attributes, which did not involve a representation of the words.

The stacked models had an average of 0.60 Kappa in validation.

The score for this model was around 0.624 on LB. But the real intention with it was to have a different model to ensemble.

Post Processing

In all of the Top 10 solutions I read, there was one detail in common: post-processing predictions. Teams used special methods for rounding, multiplying by coefficients, or determining different cut-off points for predicting each “class”.

In one case, the team ordered the predictions and, according to the proportion of each class in the training data, rounded the forecasts. Kappa is a metric sensitive to the distribution of predictions, so this ended up helping.

The closest I got to an adjustment in the predictions was to use the geometric mean instead of the simple average for the ensemble. The models used to overestimate the relevance of the products, and this helped bias the predictions to the lower classes.

Ensemble and Overfitting

In the end, I made a geometric mean with my best models. The ensemble consisted of several SVMs, a Random Forest and a Gradient BoostedD Decision Trees.

Although not a good practice, especially in cases with a small amount of data like this competition, I used LB to validate the ensemble. This ended up making me choose my second best ensemble, yet I ended up in the Top 5% of the competition.

How I Got 6th Place Among 985 Participants in the Facebook Machine Learning Competition

In this recruiting competition, Facebook provided anonymous data about users of an auction site. The task was to create a model that determined which user is human and which robot. The human users of an auction site do not want to have to compete with machines. A site with many robots can experience an exodus of human users, so it is important to have a robust system for detecting suspicious users.

According to the administrators, this data did not come from Facebook, but I think the interest was to create an environment similar to their advertising auction system.

This competition has become quite special for me, since it was my first Top 10 on Kaggle, and I ended up becoming Kaggle Master. This was my first goal, but it also has the added bonus of being considered for an interview for the Data Scientist position at Facebook.


One of the most interesting parts of this competition, which made it very close to a real case of data science, was the fact that we did not have a raw file with variables related to the objective that could be used directly with some model of machine learning.

Two files were available: the first contained about 2000 rows with anonymous information from users, such as ID and Address. The second contained more than 7 million bids made on the site.

Variables in the second file included: Bid ID, User ID, Auction ID, auction category, user device, time, country, IP, and URL. The time was coded, so it was not possible to find out the exact date of the bids.


Having a robust validation environment is essential for any machine learning task. In this case there was the challenge of having only 2000 examples to train and validate. After testing several combinations, I decided to use a stratified cross-validation, with 5 folds, and repeat this validation 5 times.

The repetition ends up causing the examples to be randomly placed in different folds, which helps stabilize the error estimate, since one or another particular example will have less influence on the mean.

Feature Engineering

The most important part of any data science job is to ensure that you have the correct data to do the prediction. A very common mistake is to think that you just put the data in the algorithm, and it will learn what you want. This was by far the most important part of my solution.


The first step was to create variables with the count of how many bids each user made. Extending this idea, I created variables for counting how many different devices, IPs, categories, countries, and URLs the same user had used. With these simple variables, a Random Forest, without tuning, reached 0.86 AUC in cross-validation with 5 divisions.

In the case of countries, in addition to integrating the variable itself to count the number of different countries, I made individual variables for each of them, counting how many bids the user made coming from each country. This helps the model identify “risky countries,” where it may be easier to host robots to bid.

Simple Statistics

After seeing that the counts had great predictive value, I decided to test simple statistics, such as averages and deviations. Examples of important variables:

– Average and standard deviation of bids per auction;
– Maximum number of bids given in the same auction, with the same timestamp;
– Average time between bids.


The variable that indicated the time in the original data was coded so that you could not know the exact date that the bids occurred. But after analyzing the data, I realized that even though I did not know the exact date, it was possible to identify what would be days, hours and minutes.

This allowed me to create variables related to averages, deviations, proportions and counts based on units of time. Some important variables:

– Standard deviation of bids per day;
– Average bids per day;
– Average bids on the same timestamp;
– Time-of-day bid count;
– Average auctions that the user participated in the same day.

It seems that all the participants who finished at the top of the LB were able to identify these time patterns.


Some proportions based on counts and time were also important. Some examples are:

– Maximum proportion of total bids in a day assigned to the user;
– Density of bids per hour of day


When you have variables that contain enough information about the event, the selection and adjustment part of the model’s parameters becomes simpler. In this case, despite briefly testing linear models, I ended up choosing to better explore models based on decision tree ensembles.

Random Forest

The first model I tested, and the one that proved itself most useful in this case.

I believe that its superior performance was given by being a model more directed to the reduction of the variance. Because we had little data to train, it was important that the model was able to cope with extreme values, and stabilize predictions.

A Random Forest with 500 trees and parameters adjusted by cross-validation had 0.9112 AUC in the local measurement, and 0.9203 in the LB.

Gradient Boosted Trees

This is one of the most used models in data science competitions because it offers a good performance in several types of tasks. In this case it was not so. My best models were around 0.90-0.91 AUC on LB despite reaching 0.94 on validation.

This raises the question: we have a decision tree-based model that gives us a good performance (Random Forest) and, in theory, boosting would improve the score, why is it not the case here?

My answer is: Random Forest increases the bias, that is, it makes the model more rigid, with less variance, while the GBT increases the variance, being more sensitive to the noise present in the data. In this case we have less than 2000 examples, and a strong imbalance between classes. These two combined reasons generate data with a good amount of variance, increasing it further will cause overfitting.

The solution would be tuning and regularizing the GBT, but with little data to validate it is difficult to trust the results. So I decided to rely on theoretical reasons and continue working with Random Forest to have a more stable model.


The final touch of most of the best competition solutions is to make an ensemble with the best models.

Random Ensemble

Due to the small number of examples, it was difficult to rely on the parameters found through validation. So I decided to create an ensemble with Random Forests randomizing the parameters. Each had a seed and a few different parameters. This ensemble achieved an AUC of 0.92 in the out-of-bag estimate and in LB.

AdaBoost Random Forest

Despite the instability of the GBT, I decided to apply boosting to Random Forest. Perhaps applying this method to more stable models than decision trees could help, even with the small number of examples. And that’s exactly what happened.

In the validation it obtained 0.925 of AUC, and 0.928 in LB.

There was still a difference between the validation score and the LB when I added new variables.

Stabilizing the Seeds

To stabilize the predictions I decided to run the Boosted Random Forest model using several different seeds, and make the average. This did not give a good result in LB, but it had stability in the cross validation, which made me trust the model.


In the end, I used a Boosted Random Forest model with a set of attributes that presented AUC 0.94 in cross-validation, and it was basically my score in the final test set of the competition, which gave me the 6th place.

Classifying 150 Thousand Products in Different Categories Using Machine Learning

The Otto Group is one of the largest e-commerce companies in the world.
According to them, due to the diversity of the company’s global infrastructure, many similar products are classified into incorrect categories. So they made available data on approximately 200 thousand products, belonging to nine categories. The objective was to create a probabilistic model that correctly classified the products within each category.


For training, about 60 thousand products were available and, for testing, about 150 thousand.

The 93 features were not identified, the only information given about them is that they were numerical variables. This makes it difficult to understand the best way to work with the data, and also the work of creating new features.

In addition, the task involves multiple classes, that is, we have to generate probabilities for 9 classes.


The metric chosen for this competition was the multi-class log loss.

This metric punishes very confident probabilities in wrong classes, and is sensitive to imbalance between classes. Some of them represented 20% of the training data, while others, 9%. In this case, trying to balance classes, whether with subsampling or penalizing smaller classes, will not help.

Feature Engineering

Even with anonymous variables, I decided to explore the possibility of integrating interactions between them, which might not be captured by the models.


I tried to create new features based on sums, differences, ratios and products between the originals. The only feature that contributed significantly was the sum of all attributes for a given product. Most of the attributes were pretty sparse (had more zeros than other values), so maybe that contributed to the lack of success.

Using a GBM / RF to Select Features

As we are talking about a gigantic space of possibilities of combinations of attributes, I decided to use a technique used by a participant of another competition and published in this link: http://trevorstephens.com/post/98233123324/armchair-particle-physicist

Basically it consists of creating some datasets with the desired interactions and training a model of Gradient Boosted Trees in each one. After that, we check which are the most important features in each dataset. The logic is: if a feature is important in several different datasets, it should be important overall.

I tried it, using 2 stratified divisions, but although they agreed on the importance of some interactions, they did not improve my base model. As I had already spent days dedicating myself to this part, I decided that I would focus more on fitting models for an ensemble.

If I had continued, perhaps selecting the X best interactions and re-adjusting the hyperparameters of some model could extract value from these variables. Or cause terrible overfitting.


Gradient Boosted Trees (XGBoost)

My individual model with the lowest log loss was created with XGBoost, which is a fast and parallel implementation of the powerful Gradient Boosted Trees. This model has already been in winning solutions for a lot of competitions and usually has superior performance.

To find the best attribute group I used a random search. I fixed the learning rate in 0.05 and varied attributes such as tree depth, minimum examples that should compose a node, and the proportion of examples and variables that the algorithm should randomly select to create each tree.

Usually the lower the learning rate, the better the accuracy, but more trees are needed, which increases training time. In the end, I left this value at 0.01 and trained 2000 trees.

These attributes control overfitting. After finding the best attributes, the log loss of cross-validation in 3 divisions was 0.4656, and in the leaderboard, 0.4368.

Random Forests

Although I didn’t have initial success with Random Forests in this competition. One suggestion given in the forum was to calibrate the predictions. Recently the scikit-learn developer team has made available a new tool that allows us to adjust the output values ​​of a model so that they become closer to the real probabilities.

A Random Forest usually makes a vote among its components, and the proportion of each class is given as the probability of the example belonging to that class. These proportions do not necessarily match the actual probabilities of the events, so we will only have true probabilities if we calibrate.

As this competition asked us to predict the likelihood of a product belonging to a category, it was very important to have the probabilities adjusted correctly. In this case, I used the new scikit-learn tool with 5 data splits.

This greatly improved Random Forest’s predictions, and although they also use decision trees, making them similar to the GBM, they ultimately contributed significantly to an ensemble.

Neural Networks

I had the opportunity to learn two great modules to train neural networks in a simple way in Python. They are Lasagne and NoLearn. With them, it is simple to create and train state-of-the-art neural networks, including using GPUs for processing.

But make no mistake, although they facilitate implementation, a neural network requires a lot of decisions to become useful. We have to determine the architecture, the methods of initialization of weights and regularization. Some researchers suggest doing a random search for initial parameters and continue to adjust manually.

In this case, the architecture that served me well was the following: 3 layers of neurons with the following number of units in each one 768-512-256. I trained two versions, one of them with a dropout of 0.5 between the hidden layers. And another with dropout of 0.5 only on the first hidden layer. The hidden layer units were ReLu, and the output unit, softmax.

An interesting interpretation of another competitor who came to a similar architecture is that the first hidden layer allows the network to make random projections of the data, and the smaller layers that come after it, seek a synthesis of the information. Neural networks are naturally difficult to interpret, but I found this a very interesting point of view.

Finally, since the solution of a neural network is a local minimum, and dependent on the initial weights (and we have many initial weights), I decided to average 5 neural networks with different initial weights. This average gave me a score of 0.4390 on LB, comparable to XGBoost.


I still tried to train: SVM, Nearest Neighbors, Logistic Regression, but none of them performed well or contributed significantly to the ensemble.

Adjusting the Hyperparameters

Since we did not have access to the contents of the variables, it was vital to adjust the parameters to have the best models. At first I did a random search with reasonable values, but would allow the models to explore the space a bit.

Usually the hyperparameters of a model are not independent, so adjusting one at a time will not get you the best combination. It is therefore important to perform a scan with parameter combinations. Unfortunately most of the time it is impossible to test all possible combinations, but when doing the random search we can get an idea of ​​the solution space, without having to explore it exhaustively.

Once this exploration is done, it is good to manually vary some of them, and see if it is possible to improve performance on some neighboring combination.


In the end my solution was a weighted average of the following models:

– Gradient Boosted Trees (XGBoost) on the original variables and the sum of the columns.
– Random Forest on the original variables, log (X + 1), and sum of the columns.
– Neural network with 3 layers and dropout of 0.5 in each, on the original variables, log (X + 1) of them, and sum of the columns.
– Neural network with 3 layers with dropout of 0.5 in the first hidden layer, without dropout in the second, on the original variables, log (X + 1) of them, and sum of the columns.

In addition to these models, I’ve added a bias, simply predicting the proportion of each class as a probability of an example belonging to them.

This solution hit the log loss of 0.4080, and guaranteed me the 35th position among 3514 teams (Top 1%).

Winning solutions

The winning team’s solution consisted of a three-layered ensemble:

In the first layer 33 different models were trained, varying both the type of model and the variables used to train each one. Some were multiple models trained with bagging.

In the second layer, the predictions of these models were used to feed an XGBoost, a neural network, and an ExtraTrees with Adaboost.

In the third, and last, layer, they bagged the three models of the second layer (totaling 1100 models) and then did a weighted average among them.

In addition, new variables were created, based on the distance of the example to the closest examples of each class, as well as others based on the TF-IDF and T-SNE transformations of the original dataset.

Using Machine Learning to Identify Drivers From GPS Data

In recent years the insurance industry has been looking for ways to improve its models using Machine Learning. One is to use data that goes beyond a form completed by the insured to determine the risk of accidents.

One of the methods used is to use driver behavioral data obtained through GPS tracking. In this way it is believed that it is possible to capture information and profile patterns that go beyond traditional methods.

What makes a driver different from the other? Identifying the driver who is driving a vehicle during a trip is the first step in getting a good model.

Competition Description

The insurance company AXA decided to provide anonymous trips of approximately 2700 drivers. For each driver there are 200 trips, some of which are not from the driver in question, and the task was to create a model that would identify which trips were attributed to this driver incorrectly.

To avoid information leakage, the data has been centralized and rotated in a random manner, and some portions of the trips removed from the beginning and the end.


In total, 200 trips of 2736 drivers were made available. Within the folder of each driver was a random amount of trips falsely attributed to the driver in question. An important information is that we had the assurance that most trips truly belonged to the driver.

Each trip was described in a CSV file with a numerical value for the position x and y, which would be the distance from the origin, in meters, and each line corresponded to a 1 second displacement.


The metric used for evaluation was the area under the ROC curve, abbreviated as AUC.


The features I used are based primarily on the first three derivatives of the displacement: speed, acceleration, and jerk. Other derivatives undermined performance.

Frequencies – FFT

I tried to extract frequencies from both raw raw data and derivatives. I did this with the FFT (Fast Fourier Transform) function available in Numpy. I divided the frequency range by 20, and the value of the features was the sum of the frequencies in their respective buckets.

In this case I extracted the isolated frequencies for each component (x and y). Maybe getting the frequencies using speed and acceleration with the two components (x and y) together, would have been better, but I did not test this hypothesis.

The maximum AUC I achieved, using frequencies of the three derivatives, was 0.7058, with a logistic regression without parameter optimization.


This is the transformation that found in all academic papers I read on the subject. It consists of making a histogram of attributes. In this case, we get a new parameter, the size of the intervals.

The best way I found was to divide the magnitude of the velocity in 50 intervals, and the acceleration in 10.

Two details that improved performance: using the “density” attribute of the Numpy function, which calculates the probability density function, normalizing the results so that their sum is 1.


Another frequent suggestion was to use the value of attributes at certain percentiles. This helps you figure out how extreme the values are.

An example may facilitate understanding: in the case of velocity, in order to obtain the value in the 5th percentile, I ordered the values incrementally and located the list point that represented the border of the first 5% of the ordered data.

If we have a list with 100 ordered velocity values with an index ranging from 1 to 100, we would get the value at position 5. The value at position 50, for example, would be the median.

This is important because it differentiates drivers who have a faster or slower profile.

I used the 5th to 95th percentiles, with intervals of 5.

Repeated Trips

One approach that was important in getting the best teams to the top of the leaderboard was to create an algorithm to find similar trips. It was remarkable that some stretches, although they had a different angle with respect to the origin, were the same.

I created a simple algorithm, based on the speed of the driver during the trip but, despite having found some similar trips, it ended up hurting the score.

The algorithms that succeeded were more complex, and in some cases involved the transformation of variables, reduction of dimensions, and limits to determine similarity. Of all, two methods seemed essential:

Ramer-Douglas-Peucker : because each trip had a different duration, this method was used to reduce the number of points.
Dynamic Time Warping : This method was used to compare parts of the trip and determine the similarity between them.


As it was not possible to really know what were the travels of each driver. That is, there were trips marked positive, when in fact they were negative, there was the opportunity to decide the best way to sample the data.


At first, with unsupervised methods, I used only the 200 trips of each driver. One suggestion in the competition forum was to use the 200 trips of a driver as positive examples and use trips of other drivers as negative examples.

Overall this was a very good approach. Among the ways of doing it were:

-Randomly pick an N amount of drivers and treat all of their trips as negative.
-Choose, randomly, a V amount of trips from N different drivers, to use as negative.

The approach that worked best for me was the second option: selecting a small number of trips from several different drivers.

Similar / different samples

Another alternative that I tested was to select negative samples of drivers that were more similar or different from the positive samples, according to a measure like Euclidean distance.

I created attributes such as the average speed, distance traveled, and duration of trips for each driver and determined which were the most similar, and which were the most different of the driver in question.

Although the approach using the “most different” drivers showed improvement during cross-validation, it did not improve the score in the test set. The reason I attribute to this is that the validation data contains travels that are false, but are marked as true, whereas in the test set we have the absolute reality.

Reinforcing predictions

I tried to reinforce the model predictions by training more than once, but now using the predictions from the previous model instead of the original dependent variable, as follows:

In the first pass I did the normal procedure, using the driver’s trips as positive, and others randomly selected from other drivers as negative to obtain forecasts.

In the next passes I used the classes provided by the previous model as dependent variable. This improved greatly in cross-validation, but showed a very small improvement in Kaggle’s validation data.

Models and Methods


At first I decided to group the trips according to similar characteristics. This idea was based on the assumption that, since most trips belonged to the driver, it would be possible to separate the trips into two clusters, and the one with less trips would be the “negative.”

Another idea was to increase the number of clusters, since there were differences of between the trips, even when they belonged to the same driver.

Either way, these ideas proved useless, with AUC close to 0.5.

Another unsupervised way I tried was to calculate the cosine similarity between the average of the attributes of the trips and the individual records. In this case the “probability” of a trip belonging to the driver would be one minus this value. Which also proved useless.

Logistic Regression

My most successful model was logistic regression. This surprised me, since the success described by several participants in the forum involved boosted trees.

I used about 1000 examples, 200 positive and 800 negative. An important detail was to enable the option “class_weight = auto”. This option assigns a different penalty to the examples, according to the class distribution.

In this case, an error or hit in the positive class was worth more than the same in the negative class. In this way, although we had more positive examples, the penalty for errors was balanced.

I trained a regression for each driver, personalizing the regularization penalty (L1 or L2) and its coefficient. Each driver model performed a search for the values for this coefficient using the two types of regularization, and at the end selected the type of regularization and the coefficient that presented the best AUC in cross-validation.

The best individual model reached the AUC of 0.86.

Random Forests / Boosted Trees

According to the forums, the most used model was Gradient Boosted Trees. Basically it consists of training a sequence of decision trees, assigning a greater weight to examples that were incorrectly classified by previous trees. The probability of a specific example is a linear combination of the predictions of each tree.

In my case the attempts to use this model were not successful. Two reasons may have contributed:

– For lack of confidence in validation, I ended up not paying much attention to optimizing his parameters to achieve a good AUC.

– The cross-validation process to find good parameters was going to take a long time, and preliminary tests had not yielded good results.

Anyway, I trained a model of Random Forests, with the standard parameters, to use in an ensemble. It achieved an AUC of 0.847.


Towards the end of the competition I decided to go to the “easy” way to improve the score: ensembles.

For a part of the ensemble I trained the following models:

– Logistic regression with L1 regularization only
– Logistic regression with L2 regularization only
– Linear SVM
– Random Forests

Although SVM did not offer probabilities, I thought it would be a good contribution to the ensemble, which proved to be true.

The simple mean of the predictions of these models produced an AUC of 0.8825.

Finally, I joined these with everything else I had produced since the start of the competition, for a final AUC of 0.8915. This guaranteed me a final position among the 15% best solutions of the competition.

Methods of Winning Solutions

The winning solutions used 400 examples for each driver, selecting random trips from other drivers as negative examples. The most used model was Gradient Boosted Trees.

In addition, complex algorithms to find similar trips and assign them directly to the driver have squeezed the necessary points to rank among the top 10 solutions.

Using Machine Learning to Discover Illegal Ads in Russian, While Not Knowing Russian

The site Avito.Ru, founded in 2007, is one of the 5 largest Russian classifieds websites. It is used by both individuals and businesses to negotiate new and used products. One of the biggest challenges is maintaining the quality of content in the face of growth and the volume of new ads. To resolve this, the site decided to seek a solution based on the available data.

This was a competition ran on the Kaggle platform.

For readers with technical knowledge, the final solution code is available at this link (github).

Available Data

In the available database was the content of ads, attributes about the advertised product, category and sub-category, title and description. The variable that needed to be predicted was binary: whether the ad would be blocked or not by moderation. About 4 million ads with predictions given by human moderation were made available for training, and 1.5 million for testing.

In addition there was a variable indicating whether that ad had been blocked by an experienced moderator, which could make a difference in the human error factor.

Evaluation Metric

The metric used to evaluate the best solution was AvgPrecision @ K (Top at K). After assigning a score indicating the likelihood of a particular ad to be breaking the rules, they should be ordered so that the most likely were higher in the rankings. Once this was done, the system considered the top K ads and compared them with the actual true values to decide the correct hit rate of the model.

Data Transformation

My solution involved only the title and description of the ads. First the basics, put all the words in lowercase, remove stopwords and stem. One detail is that the ads were in Russian, so I do not know exactly which stopwords have other meanings (in English, “can” is a stopword, but it’s also used as a noun). Either way, these procedures improved the model’s performance. Also, I turned the documents into a numeric matrix where each line was an ad, and each column a word. For the element values I tested three variations:

– Binary: where the presence of the word in the ad was indicated with the number 1;
– Count: each value was the number of times the word appeared in the ad;
– Tf-Idf: each value was based on a formula that takes into account the frequency of the word in the ad, and also in relation to other ads. It assigns a higher value to rare words in the overall context, but which have a high frequency within the specific ad.

Among these alternatives, the one that demonstrated the best performance was the Tf-Idf. This is a technique widely used for text classification, and generally shows improvement in classification accuracy over the other options.

Despite making some attempts to clean the data, like removing the rarest words, the performance was negatively affected. What helped a little was removing numbers.


My focus was to create a simple solution, so my first trial was with Logistic Regression on all data. One of the factors to take into account was the distribution of positive and negative examples, which was not 50% for each. With this solution the accuracy was 91.8%.

After testing a few options available in scikit-learn (machine learning library in Python that I used), I found that using the “modified_huber” loss function created a more accurate model. This is a more robust function than log loss, since it applies quadratic penalty for small errors, and linear for large ones.

Another idea that helped a lot was to separate ads by category. Each category had a different proportion of positive and negative examples (Some with less than 1% and others with more than 30%). Applying the above algorithm to the transformed data in this way got 97.3% score. An improvement of 10.6%.

For the final solution, I also trained a Naive Bayes model that, despite assuming that the variables are independent, has a good performance for text classification. By averaging the solution of the two algorithms, I achieved the final score of 97.9%.

Differences to the winning solution

Comparing my solution with the solution of the winning team, there is a small difference of 0.8% in the score. But when we look at complexity, the winning solution used more than 30 models, between transformations and classification, for each example of the training set. In practice, as is normally the case in these competitions, it would not be worth implementing the winning solution. But this does not take out credit from the winners, and also the opportunity to learn.

Using Machine Learning To Predict Which User Will Click An Ad

Avazu is an advertising platform that delivers ads on websites and mobile applications. One of the biggest challenges in the industry is determining which ad is most relevant to the user. If your ad is in accordance with the user’s interests, the chance of click is higher, and this increases the profits of both the platform and the advertiser, which will have a potential customer visiting his site.

Machine Learning have been used by several companies delivering ads on pay-per-click format. That is, they are paid by the advertiser for each click on the ad. Among the applications, the creation of intelligent models to calculate the likelihood of a user clicking on the ad, based on:

Profile information : location, device used;
Advertiser information : ad category, size, and color;

And also data about the site or app in which the ad is displayed.

In this repository I make available the original code for the implementation of FTRL Proximal by Kaggle’s user tinrtgu, based on the paper developed by Google, as well as the modified code to obtain my best single model.

Data Description

In this competition the Avazu made available a dataset with the log of 10 days of impressions and clicks with information about users, advertisers and display platforms. The goal was to predict the likelihood of a visitor, on an app or site, clicking on the displayed ad.

A day after the training set was kept as the test set to evaluate the model. The training set had about 40 million lines, and the test about 4 million. Furthermore, it was a high dimensional problem, with more than 1 million independent variables.

The metric chosen for evaluation was LogLoss, which heavily punishes inaccurate predictions that are very confident. We should keep in mind that there was an imbalance between the proportion of positive and negative examples, and this metric favors correctness in the most represented class (in this case, the negative ones).

Baseline model

Predicting a probability of 50% for all examples provided a log loss 0.6932. The first model I tested was a logistic regression with Vowpal Wabbit without regularization and only 1 pass over the data. It resulted in a log loss of 0.3993.

Cleaning the Data

Due to the large number of features and values, some of them didn’t occur frequently in the dataset, and this can hurt the classification. Logistic regression is sensitive to outliers in the features, so removing them could improve performance.

I tested two strategies: removing the outliers, or grouping them under a “rare” variable. The choice that significantly improved log loss was to group the values of features that were present in less than 5 examples.

The same logistic regression, without any optimization, applied to the “clean” dataset reached a log loss of 0.3954.

Validation Issues

In this competition one of the big problems was finding a suitable way to create a reliable validation set.

These data change rapidly: new variables come in, and old variables are no longer used in a matter of days or weeks, so it is natural that there is this difficulty in reaching a consensus on the exact performance of the model in the future.

Some ways were discussed in the competition forum, and two of them were widely used:

Progressive Holdout

In this modality, for each N examples used to train the model, the log loss was calculated in example N + 1. After going through the entire dataset, the average of these scores is calculated. In my attempts this proved to be a rather optimistic method.

Validation in the last days

Because this data has a time dependence, the model was trained excluding the last, or the last two days of the dataset. And then they were used for validation. This alternative showed itself more reliable than the other, but was not very robust to the choice of interactions between variables, so it was still necessary to be careful about overfitting in these cases.

Feature Engineering

All features of this competition were binary, and some were anonymous, so I had no information about what they represented.

I tried to count how many times each value of a variable appeared in the dataset and use this value as a new feature. It did not work, and it worsened performance. Although it worked for other teams.

An alternative that showed improvement, but nothing significant, was to create a variable indicating if a given example had more or less than 100 occurrences.

Interaction between variables

Another alternative was to create interactions between two or more variables. This was done simply by creating a variable that indicated the presence of combinations between values. At first I tried the interaction between all the pairs of variables, which worsened performance.

I made another attempt, this time manually creating combinations that seemed relevant to me (representing an user’s access to an app, for example). And they eventually improved the model, reducing the log loss to 0.3889.

Models Trained

Logistic Regression with Vowpal Wabbit

I took this opportunity to learn about a new tool: Vowpal Wabbit. A fast implementation of Machine Learning algorithms that minimize convex functions for both regression and classification. Just to keep in mind, the “pure” logistic regression, without data cleansing, had a log loss of 0.3993. After cleaning this number dropped to 0.3954.

Calibrated SVM with Vowpal Wabbit

Since it was necessary to send a list of probabilities to Kaggle, I tried to use the distance of the data to the SVM hyperplane as inputs, both for a logistic regression (inspired by Platt’s scaling) and for an isotonic regression, available in scikit-learn.

These are two popular ways to calibrate an SVM so that it gives us probabilities. None showed a good improvement in the score.

Follow the Regularized Proximal Leader (FTRL Proximal)

Google Paper – Ad Click Prediction

This was the algorithm that helped me greatly improve the log loss. It was implemented by one of the competitors and made available in the competition forum. Developed by Google for the same task of calculating ad click probability based on user information, it creates a more sparse representation of the data, and ends up being more robust against outliers. In this paper the author describes the implementation and characteristics of it compared to other algorithms used for the same task.

It can be said that it is basically a logistic regression with adjustments to not need to use as much memory to store the weights, and also an optimization method that forces the least significant weights to take the absolute zero value. That is, in a problem with millions of variables like this, disregards those that are not really important and, in this way, automatically selects features.

With the clean data, a slight tuning, and 3 passes on the data, this model hit a log loss of 0.3925.

Neural Networks

I tried to use neural networks both in Vowpal Wabbit and my own implementation. In Vowpal Wabbit the neural network hos only one hidden layer with sigmoid activation. It didn’t show an improvement.

I created, in Python, a neural network with ReLu activation units. This type of activation has been widely used in the area of Deep Learning, due to the fact that it does not have the problem of exploding or disappearing gradients, besides favoring sparse data representation. In some cases the result is equivalent to, or better than, networks with stacked unsupervised layers.

I only used a hidden layer, and in this case, there was an improvement in the validation data, but it did not translate into an improvement in the official competition score. Maybe using more than one hidden layer, and getting into the Deep Learning area would have helped, but I did not have time to test it.

The best neural network, in the original clean data, with 30 neurons in the hidden layer, and ReLu activation, reached the log loss of 0.3937.

Hashing trick

In a problem with many independent variables like this, storing weights for all of them in memory becomes an issue. Although this is a small dataset when we talk about big data, it already requires the use of the hashing trick.

This technique consists in hashing the features values and assigning weights to the buckets, instead of directly to each column. In this case, because we only have binary variables, it is fairly easy to use.

After the hashing is done, at each iteration of training we update the weights of the buckets. There is a risk of collision, but the higher the number of buckets, the lower the risk.

In practice there is no significant loss of performance for this reason, so this has become a widely used technique in problems involving high dimensionality. Both Vowpal Wabbit and Proximal FTRL use this technique.


It is almost impossible to score well in a Kaggle competition without putting many models together in an ensemble. Basically, if you have models that have similar accuracy but different errors, there is a good chance that if you join their predictions, you will have better performance.

Models in partitions of the data based on independent variables

The first attempt at ensembling was to create individual models for certain partitions of the dataset. For example, create 24 datasets, each containing only the samples at a given time, or create a dataset for each application ID, and train individual models in them. Then subdivide the test set in the same way and get predictions from these individual models.

In some cases, if partitions are made into groups that actually differ between them, performance can improve greatly.

After creating several models based on subgroups that seemed to be different between them, the final ensemble, taking the simple average of 8 models, reached a log loss of 0.3935.

Models based on random interactions of the variables

Another alternative to create ensembles is to use different variables for each model. After checking that some interactions improved the score using Proximal FTRL, I decided to create a script that tested the effect that an interaction would have on the ensemble score. Although interesting, one should be careful about overfitting.

Each model selected an interaction and tested whether the performance of an ensemble between it, and some model with other interactions, improved. This attempt generated 5 linear models with different combinations of variables, which together, taking the simple average, reached a log loss of 0.3888.

Putting together the best models

In the end, I put together several models that showed a good score, and reached the log loss of 0.3878 that guaranteed the 42nd position, among the top 3% solutions.

The difference between the log loss of this solution and the winner was 0.0087.

Other ideas and the winning solution

After the end of the competition it is common for the participants with the best positions to disclose their models and ideas that contributed to achieve a good performance.

In addition to the methods described here, two approaches stood out: attributes that took into account temporal factors, such as the number of visits a user made to a site or application in the last hour; number of visits on the same day; and using attributes related to the count of times the variable, or interactions between variables, appeared. Knowing how to use the temporal characteristics of the data seems to have favored the winners.