fbpx

Mês: fevereiro 2017

Classifying Customer Visits to Walmart in 37 Categories Using Machine Learning

Defining a customer’s intent when visiting a store, whether real or virtual, can help a company deliver a personalized experience. That’s why it’s important to use tools that can help categorize and identify these visits.

Walmart has provided anonymous trip data for its customers to some of its stores. Among the data were the items purchased and their respective quantities, as well as descriptive information about the products and the day of the week. The task was to use this information to categorize the intent of that client’s visit.

Despite not giving us a reference table for each code, the organizers said that some examples of categories of visits are: weekly purchase of food, gifts for a celebration, or purchase of clothes for the new season.

Data

There were about 95,000 trips for training, and 95,000 for testing. The file contained basically one line for each product purchased during a visit, and it was the competitor’s responsibility to properly transform it to merge the data for each trip.

The following information defined the products: the UPC code, with more than 100 thousand different types; The Fineline code, which is a refined category created by Walmart, with about 5,000 types; And the Department, which contained about 60 different types.

In addition, the ScanCount variable defined how many products of that type were purchased, and if it was a negative number, the customer was returning products.

Transforming the Variables

In order to train a model it was necessary, as a minimum, to group the basic information by trip. Several reports on the forum talked about transformations resulting in thousands of variables.

As I only had 7 days to work in this competition, my goal was to create a simple and compact model, with the fewest possible variables, but with a good performance to position me among the top 10%.

Basic Variables

Examples of variables based on statistics of each trip are: the average amount of each product purchased during the trip, indication if there was a return of product, sum of the quantity of all products purchased and the department with the highest quantity of items purchased.

Counts and Proportions

I decided to add the amount of products purchased by department, and use both the count of each department and the proportion that this department occupied in this purchase. This created about 120 variables.

Singular Value Decomposition + TF-IDF

To try to use Fineline and UPC without increasing the number of variables, I decided to add the quantity of products in each of them and make two transformations.

First, TF-IDF, which replaces the quantity by a weight relative to the proportion of an item on that trip, and how often is the presence of this item on other trips.

Then I applied the SVD, which tries to find the components that have the most variation in the data.

These transformations are usually used with text, applied to the word count in each document, and known as Latent Semantic Analysis in that context.

In addition to reducing the size of the data, it is expected to remove much of the noise, and find structural categories to which the trip belongs.

In practice this helped a lot with Fineline, but it did not help much with UPC.

Logistic Regression L1 and L2

Another way to reduce size is to train a simpler model and use its predictions as a variable in the main model.

For this, I trained two logistic regressions: one with L2 penalty and another with L1 penalty. This generated about 70 variables, 37 for each regression, with the probability of an example belonging to each class.

Models

Gradient Boosted Trees – XGBoost

Most of my time was focused on building a good model with XGBoost. This model was already good enough to stay in the top 10%.

Neural Networks

To complement XGBoost, and try a better position, I decided to train a neural network on the same variables. It had 2 hidden layers and dropout.

Other competitors reported good results with neural networks, but I did not spend much time tinkering with them. The intention was only to achieve a slight improvement over the result of XGBoost.

Possible Results and Improvements

The final solution was a simple ensemble between a neural network and XGBoost, which was enough to secure a position among the top 7%. The solution took about 8 hours of work.

Using only the Gradient Boosted Trees model, without ensemble, it was possible to stay in the Top 9%. This model had only 200 variables. Most of the better models had more than 5000 variables. Certainly increasing the number of variables could make this model position above the Top 5%, but it would take a lot of time to train.

Other possible improvements would be: to increase the ensemble, to create a set of variables optimized for the neural network, to use directly the variables with Fineline counts, to better tune the parameters of the logistic regressions and the transformations.

How We Won the Caterpillar Machine Learning Competition at Kaggle

Caterpillar is a manufacturer of industrial equipment such as tractors and engines. To maintain their operations, they need to buy tubes of various specifications, from several different suppliers, to use in their production line. Each supplier and product has a different pricing model.

The task in this competition was to create a model that would be able to price the tubes using historical supplier data and product characteristics.

I had the pleasure of competing on the team that won this competition, surpassing more than 1300 teams of data scientists from around the world. In this article I want to describe the most important parts about creating the solution that secured this victory.

Having a better familiarity with my part of the solution, a good part of this article refers to it, but I want to make it clear that the winning solution was created using different models developed individually by each member. None of us would have managed to win alone, and all were essential for the team to reach first place.

Data

Several files with characteristics of the tubes and the suppliers were available. There are two basic categories of pricing: fixed price and variable price according to quantity.

There were about 30,000 lines for training and 30,000 for testing. Something important to consider is the fact that several lines referred to the same tubes, changing only the minimum amount needed to purchase to get that price.

Among the available variables were: quantity, tube ID, date of purchase, annual use forecast, diameter and length.

Validation

This is one of the most important parts of any data mining task. It is essential to create a validation environment that has the same or very close characteristics of the production environment. In our case, the test data.

This means that it would not be enough to simply shuffle the lines and distribute them in folds as in “standard” cross-validation. In this case, I decided to create folds based on the tube IDs. In cross-validation, instead of distributing rows randomly, each fold contained all rows of a given tube.

Although it was a time series, training and test data were not split between past and future, so taking the date into account to define the validation data did not bring benefits.

This validation scheme proved to be quite robust and close to both the public leaderboard and the evaluation in the test data (private leaderboard).

Feature Engineering

After a robust validation environment, the most important part is finding data characteristics that can be transformed into variables to feed the model.

I will present transformations that, in general, can be used with other data, so they are not specific to this task.

Ordinal Representation of Categorical Variables

The most popular implementations of decision trees based models available in Python treat all variables as numeric. In this case there are some alternatives, such as one-hot encoding, so that the model can find unique characteristics of the levels of categorical variables.

But by the very nature of the model, coding variables ordinally allows decision trees to approximate the value of each category as if it was coded in the correct way.

In this case we have a categorical column in which each level is represented by a number.

Record Counts

The number of times a record appears in the data can also have predictive value.

A common, and quite useful, transformation is to replace the categorical values ​​by counting records that belong to that level. In this case there was a significant improvement in the error.

Minimum and Maximum Quantity Available

Utilizing the fact that about 90% of the tubes had their price varying with the purchased quantity, I realized that the more expensive tubes had a lower maximum purchase quantity than cheaper tubes.

Most of the more expensive tubes had their lowest price achieved when buying 5 units, while the cheaper tubes went until 250 units.

This attribute was very important, and demonstrates the value that exists in trying to understand the data and the task in question.

Information Leaks in IDs

This is the situation where we have information in the training data that allow us to discover patterns in the dependent variable, but which should not exist in reality.

In this case, each tube was identified by a number, and when sorting the data using this variable, I realized that there were small tube clusters with similar prices and IDs. This means that the ordering of the tubes had a predictive value.

To use this information, I simply created a variable with the numeric part of the IDs.

Both in competitions and in practice, it is important to look for these patterns. In the first case, this can help you achieve a better result. In the second case, all information leakage must be removed so as not to compromise the performance of the model in a production environment.

Weight of Components

Each tube had different components that would be attached to it. The characteristics of each component varied, but one of them stood out, the weight.

With access to the quantity and type of component used in each tube, I calculated the sum of the weight of the components of each tube, and this was a variable with a significant correlation with the price, and small correlation with other features.

Machine Learning Models

Gradient Boosted Trees

Famous for his good performance in competitions of different areas, this was my best model. I used the XGBoost implementation, which is one of the best open source versions I know of this model.

XGBoost has the option to optimize the RMSE, but as we are optimizing RMSLE in this competition, ie the logarithmic error, I made the transformation of the dependent variable with the log (y + 1) function. There were suggestions to use the root of the variable instead of the log, and my best model ended up being trained on the variable transformed with the 15th root.

This model, alone, was enough to get the 10th place.

Regularized Greedy Forests

This was the novelty of this competition. Although I tested this model in the Facebook competition, I had never dedicated time to make it work.

The algorithm is basically doing the following: it optimizes a loss function equal to Gradient Boosted Trees, but with a regularization term. In addition, it periodically readjusts the weights of each terminal node of the ensemble trees.

In theory, it’s an improvement over Gradient Boosted Trees, but in this competition it did not present a better performance. I believe that with more time to adjust the parameters and train it is possible to achieve a better result.

Other Models

The team trained two other types of models to compose the ensemble, but they alone did not perform well:

– Neural networks
– Factorization Machines

Ensemble (Stacking)

The ensemble was done in three levels, using models of all the members of the team.

At the first level, the models received the original data and made their individual predictions on data outside the training sample in the same cross-validation scheme.

At the second level, four models used predictions created at the previous level as training data, and again made predictions using a cross-validation scheme to avoid overfitting.

At the third level, the mean of the predictions of the second level models was used as the final forecast.

Conclusion

In this competition, it was extremely unlikely to win without forming a good team. The use of solutions created by different members, which eventually improved the predictions in the ensemble, was essential.

In addition one of the most important factors for winning a Kaggle competition is time. The more time you have to test ideas, the greater the chances of success.

Determining Product Relevance in a Search Engine With Machine Learning

One of the most important tasks for an e-commerce site is to ensure that users find the products they are looking for. This includes a search engine that returns relevant products. In this competition, CrowdFlower, a specialized platform for data collection, made product search data available, and the median relevance score given by three people for each pair of product and search term.

The goal was to create a model that could calculate the relevance of a product to a given search phrase.

Data

Approximately 10,000 training records were available, and 20,000 were separated for testing. In these records there was the search term, title and description of the product, the median of relevance and the variance of the relevance attributed by the users.

In the test samples there was a portion of “false” records, which were not used to calculate performance, but were placed to avoid human labeling.

Evaluation Metric

The evaluation metric was the Quadratic Weighted Kappa. It measures the disagreement between two “evaluators”. In this case, it measured the disagreement between the score given by the model and the score given by humans. It is sensitive not only to accuracy but also to the score (label) distribution. A random solution would result in a Kappa of 0, and total agreement, in Kappa equal to 1, but there are cases where Kappa becomes negative.

Data Processing

Text

The basic transformation of title and description is to make a bag of words. In its simplest form it is a matrix whose rows are the examples (documents) and the columns are words (terms). The value of each cell is the number of times the word appears in that document.

A more advanced transformation is to take this matrix and do some operations using the frequency of terms in each document, and also in the document body. This is called TF-IDF. In this case, the most frequent terms within a specific document receive a greater weight, but that is discounted by a factor inversely proportional to its frequency in other documents.

Put simply: Rare words that appear a lot in a document are given more weight, and frequent words that appear in all documents become less important.

Other than that, it is important to determine a minimum limit of times that the word should appear in the body of documents, to become a column, in order to reduce noise.

The best representation I found was to train a matrix based only on titles, and another only in the description. Some participants suggested training in both at the same time, but since this is a search problem, I believe it is important to differentiate them.

SVD – Latent Semantic Analysis

As you might expect, the matrix of documents is huge, and it’s pretty sparse. This makes it difficult to use with some non-linear models. In this case there is the option of reducing dimensionality using SVD. This linear algebra algorithm basically finds the main components of the document matrix, where there is greater variation. This reduces the dimensionality of the problem, but can cause loss of information.

In Natural Language Processing they end up working well, because they filter a bit the noise of the data. Applying the SVD to the TF-IDF matrix gives you a representation known as Latent Semantic Analysis. The interpretation is that each component obtained refers to a “concept” represented by the documents. It is also widely used to visualize documents in a space with small dimensions.

In the competition I used SVD to reduce the dimensionality of the TF-IDF matrix before feeding it to the models.

Other Attributes

In addition to the above features, I created others that seemed to make sense, and some that I found in related academic papers. The most important of these was the percentage of search words that were in the title. In addition, I created attributes based on the average similarity between a product and the others of the same search term, as well as the count of times the search term appears in the data.

An important part was coding each search term with one hot. That is, for each of them I created a column that had the number 1 if it was present, and the number 0 otherwise. I did this to help the model capture the predictive value inside each search term.

Regression or Classification?

There were two ways to look at this task: to consider each level of the median as a class, thus being a classification problem, or to consider as a regression problem and to round out the predictions.

At first I thought it was more appropriate to treat it as a regression, but since Kaggle wanted the rounded predictions, I ended up creating several classification models and did not have time to test the other alternative.

With the classification we have two options: use the class provided by the model, or use the probabilities of each class to make a weighted average to get the score. The first alternative proved to be better in this case.

Models

Support Vector Machine

The best models were obtained using SVM. It is one of the most used models for text classification, so it is no surprise that it performs well.

The best individual model, using the SVD components and the other attributes, reached a Kappa of 0.672 in the cross-validation, and 0.654 in the LB.

Random Forest

One of my favorite models, but not usually the best choice for sparse data in high dimensions. Anyway, after several tests, I got 0.6695 in cross-validation, and 0.6508 in LB using a TF-IDF matrix with the other attributes.

One reason that may help explain the fact that it behaves better with the TF-IDF matrix than with the SVD matrix is ​​that this is a method that reduces variance, so it turns out to be robust even with many attributes, as long as they are relevant to the prediction.

Gradient Boosted Decision Trees

Although this model is also not very common for use with sparse data in high dimensions, I decided to train several “weak” models and stack using their predictions and the other attributes, which did not involve a representation of the words.

The stacked models had an average of 0.60 Kappa in validation.

The score for this model was around 0.624 on LB. But the real intention with it was to have a different model to ensemble.

Post Processing

In all of the Top 10 solutions I read, there was one detail in common: post-processing predictions. Teams used special methods for rounding, multiplying by coefficients, or determining different cut-off points for predicting each “class”.

In one case, the team ordered the predictions and, according to the proportion of each class in the training data, rounded the forecasts. Kappa is a metric sensitive to the distribution of predictions, so this ended up helping.

The closest I got to an adjustment in the predictions was to use the geometric mean instead of the simple average for the ensemble. The models used to overestimate the relevance of the products, and this helped bias the predictions to the lower classes.

Ensemble and Overfitting

In the end, I made a geometric mean with my best models. The ensemble consisted of several SVMs, a Random Forest and a Gradient BoostedD Decision Trees.

Although not a good practice, especially in cases with a small amount of data like this competition, I used LB to validate the ensemble. This ended up making me choose my second best ensemble, yet I ended up in the Top 5% of the competition.

How I Got 6th Place Among 985 Participants in the Facebook Machine Learning Competition

In this recruiting competition, Facebook provided anonymous data about users of an auction site. The task was to create a model that determined which user is human and which robot. The human users of an auction site do not want to have to compete with machines. A site with many robots can experience an exodus of human users, so it is important to have a robust system for detecting suspicious users.

According to the administrators, this data did not come from Facebook, but I think the interest was to create an environment similar to their advertising auction system.

This competition has become quite special for me, since it was my first Top 10 on Kaggle, and I ended up becoming Kaggle Master. This was my first goal, but it also has the added bonus of being considered for an interview for the Data Scientist position at Facebook.

Data

One of the most interesting parts of this competition, which made it very close to a real case of data science, was the fact that we did not have a raw file with variables related to the objective that could be used directly with some model of machine learning.

Two files were available: the first contained about 2000 rows with anonymous information from users, such as ID and Address. The second contained more than 7 million bids made on the site.

Variables in the second file included: Bid ID, User ID, Auction ID, auction category, user device, time, country, IP, and URL. The time was coded, so it was not possible to find out the exact date of the bids.

Validation

Having a robust validation environment is essential for any machine learning task. In this case there was the challenge of having only 2000 examples to train and validate. After testing several combinations, I decided to use a stratified cross-validation, with 5 folds, and repeat this validation 5 times.

The repetition ends up causing the examples to be randomly placed in different folds, which helps stabilize the error estimate, since one or another particular example will have less influence on the mean.

Feature Engineering

The most important part of any data science job is to ensure that you have the correct data to do the prediction. A very common mistake is to think that you just put the data in the algorithm, and it will learn what you want. This was by far the most important part of my solution.

Counts

The first step was to create variables with the count of how many bids each user made. Extending this idea, I created variables for counting how many different devices, IPs, categories, countries, and URLs the same user had used. With these simple variables, a Random Forest, without tuning, reached 0.86 AUC in cross-validation with 5 divisions.

In the case of countries, in addition to integrating the variable itself to count the number of different countries, I made individual variables for each of them, counting how many bids the user made coming from each country. This helps the model identify “risky countries,” where it may be easier to host robots to bid.

Simple Statistics

After seeing that the counts had great predictive value, I decided to test simple statistics, such as averages and deviations. Examples of important variables:

– Average and standard deviation of bids per auction;
– Maximum number of bids given in the same auction, with the same timestamp;
– Average time between bids.

Time

The variable that indicated the time in the original data was coded so that you could not know the exact date that the bids occurred. But after analyzing the data, I realized that even though I did not know the exact date, it was possible to identify what would be days, hours and minutes.

This allowed me to create variables related to averages, deviations, proportions and counts based on units of time. Some important variables:

– Standard deviation of bids per day;
– Average bids per day;
– Average bids on the same timestamp;
– Time-of-day bid count;
– Average auctions that the user participated in the same day.

It seems that all the participants who finished at the top of the LB were able to identify these time patterns.

Proportions

Some proportions based on counts and time were also important. Some examples are:

– Maximum proportion of total bids in a day assigned to the user;
– Density of bids per hour of day

Models

When you have variables that contain enough information about the event, the selection and adjustment part of the model’s parameters becomes simpler. In this case, despite briefly testing linear models, I ended up choosing to better explore models based on decision tree ensembles.

Random Forest

The first model I tested, and the one that proved itself most useful in this case.

I believe that its superior performance was given by being a model more directed to the reduction of the variance. Because we had little data to train, it was important that the model was able to cope with extreme values, and stabilize predictions.

A Random Forest with 500 trees and parameters adjusted by cross-validation had 0.9112 AUC in the local measurement, and 0.9203 in the LB.

Gradient Boosted Trees

This is one of the most used models in data science competitions because it offers a good performance in several types of tasks. In this case it was not so. My best models were around 0.90-0.91 AUC on LB despite reaching 0.94 on validation.

This raises the question: we have a decision tree-based model that gives us a good performance (Random Forest) and, in theory, boosting would improve the score, why is it not the case here?

My answer is: Random Forest increases the bias, that is, it makes the model more rigid, with less variance, while the GBT increases the variance, being more sensitive to the noise present in the data. In this case we have less than 2000 examples, and a strong imbalance between classes. These two combined reasons generate data with a good amount of variance, increasing it further will cause overfitting.

The solution would be tuning and regularizing the GBT, but with little data to validate it is difficult to trust the results. So I decided to rely on theoretical reasons and continue working with Random Forest to have a more stable model.

Ensembles

The final touch of most of the best competition solutions is to make an ensemble with the best models.

Random Ensemble

Due to the small number of examples, it was difficult to rely on the parameters found through validation. So I decided to create an ensemble with Random Forests randomizing the parameters. Each had a seed and a few different parameters. This ensemble achieved an AUC of 0.92 in the out-of-bag estimate and in LB.

AdaBoost Random Forest

Despite the instability of the GBT, I decided to apply boosting to Random Forest. Perhaps applying this method to more stable models than decision trees could help, even with the small number of examples. And that’s exactly what happened.

In the validation it obtained 0.925 of AUC, and 0.928 in LB.

There was still a difference between the validation score and the LB when I added new variables.

Stabilizing the Seeds

To stabilize the predictions I decided to run the Boosted Random Forest model using several different seeds, and make the average. This did not give a good result in LB, but it had stability in the cross validation, which made me trust the model.

Conclusion

In the end, I used a Boosted Random Forest model with a set of attributes that presented AUC 0.94 in cross-validation, and it was basically my score in the final test set of the competition, which gave me the 6th place.

Machine Learning for Sales Forecasting Using Weather Data

WalMart is a company with thousands of stores in 27 countries. It is possible to find several articles on the technological mechanisms used to manage the logistics and distribution of the products. It is the second time they offer a contest at Kaggle with the intention of finding interview candidates for data scientist jobs.

A major advantage of this type of competition is that we have access to data from large companies, and understand what problems they are trying to solve with probabilistic models.

The objective of the competition was to create a model that could predict the amount of sales of some products in specific stores in the days before and after blizzards and storms. The example given by them in the description of the task was the sale of umbrellas, which intuitively must see an increase before a great storm.

Data

Two files were made available to train the model: one of them contained information about the identification of the stores, products, and the nearest weather stations. The other contained weather data for each station.

In total, data were available on 111 products whose sales could be affected by climatic conditions, 45 stores, and 20 weather stations. The goal was to predict the amount of each product in each store that would be sold 3 days before, 3 days later, and on the day of the weather event.

A weather event mean that more than an inch of rain was recorded, or more than two inches of snow, that day.

The combination of stores and products gave us about 4.6 million examples for training, and about 500,000 for testing. Each example referred to a day, store and product.

Evaluation Metric

The metric used was the Root Mean Square Log Error. It is basically the RMSE applied to the log (X + 1) transformation of the predictions. This means that errors in predictions that should be close to 0 would be punished more severely than errors in predictions with higher numbers. For example, predicting 5 items when it should be 0, has a greater penalty than predicting 105 when it should be 100.

Transforming the Data

Since I only had 10 days to work in this competition, I decided to check how far it was possible to come up with a model based only on the original variables and, who knows, make an ensemble.

One difference from this to other competitions is that, even though the data was organized, you were responsible for linking climate variables to the identified product and store data. It makes perfect sense because Walmart would not want a data scientist who does not know how to manipulate data.

For this I used Pandas . This Python library for data manipulation is one of the most widely used, and strongly remembers data structures available in R.

At first I used all the variables as numerical, trained an XGBoost with slight tuning, excluding the variables that coded special alerts, and used 10% of the dataset to determine the number of trees. As expected, the result was poor, about 0.1643 on LB.

Binarizing variables

After testing the first model, I coded the categorical variables with one-hot encoding. That is, for each level of the variable a column with the indicator 0 or 1 was created, if the variable was present in that example. Normally the number of columns should be the number of levels minus one, so that you do not have problems with collinearity. Since I planned to use models that were not sensitive to this problem, I did not bother deleting one of the columns.

After tuning using a subset with 10% of the data, and getting the RMSLE of 0.1216 in the validation, I sent the solution, and got the value of 0.1204 in the LB, a good improvement over the previous one.

Imputation

A lot of weather data were missing, so I decided to test a simple imputation method: replace the NaNs by the mean of the column values. After tuning again, now with parameters for these new values, I obtained the RMSLE of 0.1140 in the 10% validation and 0.1095 in the LB.

Time Variables

I did not explore the temporal dimension of the data very much, in one of the attempts I added the previous day’s meteorological information, which reduced the error to 0.1083.

Data Subsets

One method that worked very well in my first competition, and that I always try to do, is to divide the training set into small subsets related to some variable and train a specific model for each one. In this case, I decided to create 19 models, one for each of the 19 weather stations present in the test set. The RMSLE in LB was 0.1101. Worse than with a model that treats the stations as variables in the entire dataset.

A serious problem with this approach is to try to use the same model, with the same parameters, for different datasets. Knowing this, I decided to make a small tuning of the parameters for each dataset, which reduced the LMS RMSLE to 0.1069.

Despite the small difference, it seems the division into individual models for each station captured some information that was not present in the model with all considered together.

Models

Of the models I tested, two stood out: Gradient Boosted Trees (XGBoost) and Random Forests.

Random Forest

I had used Random Forest for regression only once, in a job, but never with a large amount of data. After tuning the parameters, applying the model to the imputed data, it resulted in a RMSLE of 0.1166 on the LB.

XGBoost

XGBoost presented the best error of an individual model. In addition to adjusting parameters, such as the depth of the trees, using a subset of the data, it was necessary to adjust the number of trees and the learning rate. Usually a small learning rate and a lot of trees is the safe recipe to improve performance, in exchange for more time to train the model.

Since XGBoost is sensitive to the seed of the RNG, I decided to make an ensemble of XGBs by just changing this value. This method marginally improved my score in other competitions, but in this the impact of it was greater because of the following: XGBoost has a function that allows to leave data separated so that it determines the number of trees that minimizes the error. In this case I decided to use the function, and I left 5% of the data separated. In addition to varying the seed to the XGB itself, I varied the seed to split the data, which made the models look more diverse, which is essential for a good ensemble.

The most I got to train was 10 XGBoosts in this scheme. Although it was a very stable model, the RMSLE of the ensemble was 0.1041, presenting a reduction compared to 0.1095 of the single model.

Final Solution

In the end, I put together all the solutions that I had submitted, and ended up getting a RMSLE of 0.1028, guaranteeing a position among the top 20%.

Possible Improvements

One day after the end of the competition I reviewed the variables of my XGBoost, and found that the variables that identified the products (item_nbr) were not in binary format, and were considered by him the most important. With the correct coding I believe that it would be possible to reduce the error more, and achieve a better final position. Although trees are very good to capture patterns even with categoricals in this format.

Classifying 150 Thousand Products in Different Categories Using Machine Learning

The Otto Group is one of the largest e-commerce companies in the world.
According to them, due to the diversity of the company’s global infrastructure, many similar products are classified into incorrect categories. So they made available data on approximately 200 thousand products, belonging to nine categories. The objective was to create a probabilistic model that correctly classified the products within each category.

Data

For training, about 60 thousand products were available and, for testing, about 150 thousand.

The 93 features were not identified, the only information given about them is that they were numerical variables. This makes it difficult to understand the best way to work with the data, and also the work of creating new features.

In addition, the task involves multiple classes, that is, we have to generate probabilities for 9 classes.

Metric

The metric chosen for this competition was the multi-class log loss.

This metric punishes very confident probabilities in wrong classes, and is sensitive to imbalance between classes. Some of them represented 20% of the training data, while others, 9%. In this case, trying to balance classes, whether with subsampling or penalizing smaller classes, will not help.

Feature Engineering

Even with anonymous variables, I decided to explore the possibility of integrating interactions between them, which might not be captured by the models.

Basic

I tried to create new features based on sums, differences, ratios and products between the originals. The only feature that contributed significantly was the sum of all attributes for a given product. Most of the attributes were pretty sparse (had more zeros than other values), so maybe that contributed to the lack of success.

Using a GBM / RF to Select Features

As we are talking about a gigantic space of possibilities of combinations of attributes, I decided to use a technique used by a participant of another competition and published in this link: http://trevorstephens.com/post/98233123324/armchair-particle-physicist

Basically it consists of creating some datasets with the desired interactions and training a model of Gradient Boosted Trees in each one. After that, we check which are the most important features in each dataset. The logic is: if a feature is important in several different datasets, it should be important overall.

I tried it, using 2 stratified divisions, but although they agreed on the importance of some interactions, they did not improve my base model. As I had already spent days dedicating myself to this part, I decided that I would focus more on fitting models for an ensemble.

If I had continued, perhaps selecting the X best interactions and re-adjusting the hyperparameters of some model could extract value from these variables. Or cause terrible overfitting.

Models

Gradient Boosted Trees (XGBoost)

My individual model with the lowest log loss was created with XGBoost, which is a fast and parallel implementation of the powerful Gradient Boosted Trees. This model has already been in winning solutions for a lot of competitions and usually has superior performance.

To find the best attribute group I used a random search. I fixed the learning rate in 0.05 and varied attributes such as tree depth, minimum examples that should compose a node, and the proportion of examples and variables that the algorithm should randomly select to create each tree.

Usually the lower the learning rate, the better the accuracy, but more trees are needed, which increases training time. In the end, I left this value at 0.01 and trained 2000 trees.

These attributes control overfitting. After finding the best attributes, the log loss of cross-validation in 3 divisions was 0.4656, and in the leaderboard, 0.4368.

Random Forests

Although I didn’t have initial success with Random Forests in this competition. One suggestion given in the forum was to calibrate the predictions. Recently the scikit-learn developer team has made available a new tool that allows us to adjust the output values ​​of a model so that they become closer to the real probabilities.

A Random Forest usually makes a vote among its components, and the proportion of each class is given as the probability of the example belonging to that class. These proportions do not necessarily match the actual probabilities of the events, so we will only have true probabilities if we calibrate.

As this competition asked us to predict the likelihood of a product belonging to a category, it was very important to have the probabilities adjusted correctly. In this case, I used the new scikit-learn tool with 5 data splits.

This greatly improved Random Forest’s predictions, and although they also use decision trees, making them similar to the GBM, they ultimately contributed significantly to an ensemble.

Neural Networks

I had the opportunity to learn two great modules to train neural networks in a simple way in Python. They are Lasagne and NoLearn. With them, it is simple to create and train state-of-the-art neural networks, including using GPUs for processing.

But make no mistake, although they facilitate implementation, a neural network requires a lot of decisions to become useful. We have to determine the architecture, the methods of initialization of weights and regularization. Some researchers suggest doing a random search for initial parameters and continue to adjust manually.

In this case, the architecture that served me well was the following: 3 layers of neurons with the following number of units in each one 768-512-256. I trained two versions, one of them with a dropout of 0.5 between the hidden layers. And another with dropout of 0.5 only on the first hidden layer. The hidden layer units were ReLu, and the output unit, softmax.

An interesting interpretation of another competitor who came to a similar architecture is that the first hidden layer allows the network to make random projections of the data, and the smaller layers that come after it, seek a synthesis of the information. Neural networks are naturally difficult to interpret, but I found this a very interesting point of view.

Finally, since the solution of a neural network is a local minimum, and dependent on the initial weights (and we have many initial weights), I decided to average 5 neural networks with different initial weights. This average gave me a score of 0.4390 on LB, comparable to XGBoost.

Other

I still tried to train: SVM, Nearest Neighbors, Logistic Regression, but none of them performed well or contributed significantly to the ensemble.

Adjusting the Hyperparameters

Since we did not have access to the contents of the variables, it was vital to adjust the parameters to have the best models. At first I did a random search with reasonable values, but would allow the models to explore the space a bit.

Usually the hyperparameters of a model are not independent, so adjusting one at a time will not get you the best combination. It is therefore important to perform a scan with parameter combinations. Unfortunately most of the time it is impossible to test all possible combinations, but when doing the random search we can get an idea of ​​the solution space, without having to explore it exhaustively.

Once this exploration is done, it is good to manually vary some of them, and see if it is possible to improve performance on some neighboring combination.

Result

In the end my solution was a weighted average of the following models:

– Gradient Boosted Trees (XGBoost) on the original variables and the sum of the columns.
– Random Forest on the original variables, log (X + 1), and sum of the columns.
– Neural network with 3 layers and dropout of 0.5 in each, on the original variables, log (X + 1) of them, and sum of the columns.
– Neural network with 3 layers with dropout of 0.5 in the first hidden layer, without dropout in the second, on the original variables, log (X + 1) of them, and sum of the columns.

In addition to these models, I’ve added a bias, simply predicting the proportion of each class as a probability of an example belonging to them.

This solution hit the log loss of 0.4080, and guaranteed me the 35th position among 3514 teams (Top 1%).

Winning solutions

The winning team’s solution consisted of a three-layered ensemble:

In the first layer 33 different models were trained, varying both the type of model and the variables used to train each one. Some were multiple models trained with bagging.

In the second layer, the predictions of these models were used to feed an XGBoost, a neural network, and an ExtraTrees with Adaboost.

In the third, and last, layer, they bagged the three models of the second layer (totaling 1100 models) and then did a weighted average among them.

In addition, new variables were created, based on the distance of the example to the closest examples of each class, as well as others based on the TF-IDF and T-SNE transformations of the original dataset.

Using Machine Learning to Identify Drivers From GPS Data

In recent years the insurance industry has been looking for ways to improve its models using Machine Learning. One is to use data that goes beyond a form completed by the insured to determine the risk of accidents.

One of the methods used is to use driver behavioral data obtained through GPS tracking. In this way it is believed that it is possible to capture information and profile patterns that go beyond traditional methods.

What makes a driver different from the other? Identifying the driver who is driving a vehicle during a trip is the first step in getting a good model.

Competition Description

The insurance company AXA decided to provide anonymous trips of approximately 2700 drivers. For each driver there are 200 trips, some of which are not from the driver in question, and the task was to create a model that would identify which trips were attributed to this driver incorrectly.

To avoid information leakage, the data has been centralized and rotated in a random manner, and some portions of the trips removed from the beginning and the end.

Data

In total, 200 trips of 2736 drivers were made available. Within the folder of each driver was a random amount of trips falsely attributed to the driver in question. An important information is that we had the assurance that most trips truly belonged to the driver.

Each trip was described in a CSV file with a numerical value for the position x and y, which would be the distance from the origin, in meters, and each line corresponded to a 1 second displacement.

Metric

The metric used for evaluation was the area under the ROC curve, abbreviated as AUC.

Features

The features I used are based primarily on the first three derivatives of the displacement: speed, acceleration, and jerk. Other derivatives undermined performance.

Frequencies – FFT

I tried to extract frequencies from both raw raw data and derivatives. I did this with the FFT (Fast Fourier Transform) function available in Numpy. I divided the frequency range by 20, and the value of the features was the sum of the frequencies in their respective buckets.

In this case I extracted the isolated frequencies for each component (x and y). Maybe getting the frequencies using speed and acceleration with the two components (x and y) together, would have been better, but I did not test this hypothesis.

The maximum AUC I achieved, using frequencies of the three derivatives, was 0.7058, with a logistic regression without parameter optimization.

Histograms

This is the transformation that found in all academic papers I read on the subject. It consists of making a histogram of attributes. In this case, we get a new parameter, the size of the intervals.

The best way I found was to divide the magnitude of the velocity in 50 intervals, and the acceleration in 10.

Two details that improved performance: using the “density” attribute of the Numpy function, which calculates the probability density function, normalizing the results so that their sum is 1.

Percentiles

Another frequent suggestion was to use the value of attributes at certain percentiles. This helps you figure out how extreme the values are.

An example may facilitate understanding: in the case of velocity, in order to obtain the value in the 5th percentile, I ordered the values incrementally and located the list point that represented the border of the first 5% of the ordered data.

If we have a list with 100 ordered velocity values with an index ranging from 1 to 100, we would get the value at position 5. The value at position 50, for example, would be the median.

This is important because it differentiates drivers who have a faster or slower profile.

I used the 5th to 95th percentiles, with intervals of 5.

Repeated Trips

One approach that was important in getting the best teams to the top of the leaderboard was to create an algorithm to find similar trips. It was remarkable that some stretches, although they had a different angle with respect to the origin, were the same.

I created a simple algorithm, based on the speed of the driver during the trip but, despite having found some similar trips, it ended up hurting the score.

The algorithms that succeeded were more complex, and in some cases involved the transformation of variables, reduction of dimensions, and limits to determine similarity. Of all, two methods seemed essential:

Ramer-Douglas-Peucker : because each trip had a different duration, this method was used to reduce the number of points.
Dynamic Time Warping : This method was used to compare parts of the trip and determine the similarity between them.

Sampling

As it was not possible to really know what were the travels of each driver. That is, there were trips marked positive, when in fact they were negative, there was the opportunity to decide the best way to sample the data.

Random

At first, with unsupervised methods, I used only the 200 trips of each driver. One suggestion in the competition forum was to use the 200 trips of a driver as positive examples and use trips of other drivers as negative examples.

Overall this was a very good approach. Among the ways of doing it were:

-Randomly pick an N amount of drivers and treat all of their trips as negative.
-Choose, randomly, a V amount of trips from N different drivers, to use as negative.

The approach that worked best for me was the second option: selecting a small number of trips from several different drivers.

Similar / different samples

Another alternative that I tested was to select negative samples of drivers that were more similar or different from the positive samples, according to a measure like Euclidean distance.

I created attributes such as the average speed, distance traveled, and duration of trips for each driver and determined which were the most similar, and which were the most different of the driver in question.

Although the approach using the “most different” drivers showed improvement during cross-validation, it did not improve the score in the test set. The reason I attribute to this is that the validation data contains travels that are false, but are marked as true, whereas in the test set we have the absolute reality.

Reinforcing predictions

I tried to reinforce the model predictions by training more than once, but now using the predictions from the previous model instead of the original dependent variable, as follows:

In the first pass I did the normal procedure, using the driver’s trips as positive, and others randomly selected from other drivers as negative to obtain forecasts.

In the next passes I used the classes provided by the previous model as dependent variable. This improved greatly in cross-validation, but showed a very small improvement in Kaggle’s validation data.

Models and Methods

Clustering

At first I decided to group the trips according to similar characteristics. This idea was based on the assumption that, since most trips belonged to the driver, it would be possible to separate the trips into two clusters, and the one with less trips would be the “negative.”

Another idea was to increase the number of clusters, since there were differences of between the trips, even when they belonged to the same driver.

Either way, these ideas proved useless, with AUC close to 0.5.

Another unsupervised way I tried was to calculate the cosine similarity between the average of the attributes of the trips and the individual records. In this case the “probability” of a trip belonging to the driver would be one minus this value. Which also proved useless.

Logistic Regression

My most successful model was logistic regression. This surprised me, since the success described by several participants in the forum involved boosted trees.

I used about 1000 examples, 200 positive and 800 negative. An important detail was to enable the option “class_weight = auto”. This option assigns a different penalty to the examples, according to the class distribution.

In this case, an error or hit in the positive class was worth more than the same in the negative class. In this way, although we had more positive examples, the penalty for errors was balanced.

I trained a regression for each driver, personalizing the regularization penalty (L1 or L2) and its coefficient. Each driver model performed a search for the values for this coefficient using the two types of regularization, and at the end selected the type of regularization and the coefficient that presented the best AUC in cross-validation.

The best individual model reached the AUC of 0.86.

Random Forests / Boosted Trees

According to the forums, the most used model was Gradient Boosted Trees. Basically it consists of training a sequence of decision trees, assigning a greater weight to examples that were incorrectly classified by previous trees. The probability of a specific example is a linear combination of the predictions of each tree.

In my case the attempts to use this model were not successful. Two reasons may have contributed:

– For lack of confidence in validation, I ended up not paying much attention to optimizing his parameters to achieve a good AUC.

– The cross-validation process to find good parameters was going to take a long time, and preliminary tests had not yielded good results.

Anyway, I trained a model of Random Forests, with the standard parameters, to use in an ensemble. It achieved an AUC of 0.847.

Ensemble

Towards the end of the competition I decided to go to the “easy” way to improve the score: ensembles.

For a part of the ensemble I trained the following models:

– Logistic regression with L1 regularization only
– Logistic regression with L2 regularization only
– SVM RBF
– Linear SVM
– Random Forests

Although SVM did not offer probabilities, I thought it would be a good contribution to the ensemble, which proved to be true.

The simple mean of the predictions of these models produced an AUC of 0.8825.

Finally, I joined these with everything else I had produced since the start of the competition, for a final AUC of 0.8915. This guaranteed me a final position among the 15% best solutions of the competition.

Methods of Winning Solutions

The winning solutions used 400 examples for each driver, selecting random trips from other drivers as negative examples. The most used model was Gradient Boosted Trees.

In addition, complex algorithms to find similar trips and assign them directly to the driver have squeezed the necessary points to rank among the top 10 solutions.

How to Create a Machine Learning Model for Big Data in 10 Minutes Using Python

Support Vector Machines are popular models in machine learning applications. Although they follow simple principles, they have already proven to be very powerful in various tasks and datasets. In this article I want to demonstrate how to implement an SVM capable of handling data that arrives in real time without having to store them in memory.

Click here to access the code (svm_english.py) and dataset for this article. I recommend running it using Pypy.

Main ways to train a machine learning model

There are three popular ways to train a model: batch learning, mini-batch learning, and stochastic learning.

Batch Learning : In the first mode, we store all the training data in an array and feed it to the algorithm, reducing the loss function based on all the examples at once. This is not always possible due to the size of the dataset. In these cases we have to resort to the two other ways.

Mini-Batch Learning : In this case, we select a N number of examples and divide the training set into blocks of this size. So we train the model in one block at a time.

Stochastic learning : this is a variation of the mini-batch, but with N = 1. We use only one example at a time to train the model. This is the preferred mode in big data solutions.

And we have two main divisions:

Offline : In this case we have all the data stored in a database, but it does not fit in memory, so we need to train on one example (or batch) at a time.

Online : In this case we receive the examples in real time, we train and we discard, without the need to store them.

In addition to not having to load all the data in memory to train the model, in some cases, using algorithms that train on one example at a time may be faster than the format that uses all the data at the same time.

Dataset

This article will use the Skin Segmentation Data Set that can be downloaded from the UCI database. This data refers to a classification task in which RGB values of random points are extracted from the face picture of a person, and the task is to determine, according to these values, whether that point corresponds to the skin or another part of the image.

According to the researchers, images of people from various ages, races and genres were collected. A practical application of this task would be to identify images with inappropriate content for minors on the internet.

There are about 245,000 examples, and it’s a binary classification problem.

Training Algorithm

The algorithm used for training will be PEGASOS. Academic work with the details can be found here .

In its simplest version, we will minimize the primal function of the SVM using one example at a time. As this method applies to the subgradient, it does not guarantee error reduction in all iterations. Still, there are convergence guarantees.

The only parameter to choose is the constant that measures the degree of regularization. Usually it is chosen using some validation method. As the purpose of the article is to demonstrate an implementation of SVM that can be easily adapted to data that does not fit in memory, or are received in real time, I will set it to 0.1.

In an actual application, it may be interesting to adjust this parameter using a smaller sample of the data, offline, with cross-validation.

Evaluation of the result

In a real online environment, one way to evaluate the performance of the model would be to calculate the loss before predicting a new example and averaging in a time window. This method is called progressive validation.

In our case, we are simulating an online environment, and to make the model more effective, I will use a separate test set with 20% of the data.

In the repository of this article you will find the files train.csv and test.csv, already prepared. Each contains approximately 25% positive examples, so in addition to loss, I want to evaluate the number of correct answers and errors in each class (real positives and true negatives).

The training set has 196,045 examples, and the test set, 49,012. We will use each example of the training set only once, and then evaluate the performance of the model in the test set.

Since we are not going to use this test set to adjust hyperparameters, it is a legitimate estimate of the model’s performance in new examples. If you are going to use it to validate parameters, you will need another test set.

Then we go to the implementation of the algorithm to train the SVM. We created some support functions, one to calculate the hinge loss, which is the loss we are minimizing, and another to determine the SVM output signal. The initial weights of the SVM can be 0, since we do not have problems with symmetry, as in neural networks.

Then we create a function to generate forecasts. It will simply take the dot product between the vector of weights and the vector of attributes.

Now we go to the function that will train the model. We will update the weights with the learning rate (alpha) and the gradient. This function trains one example at a time. The learning rate adjusts, being smaller with each new example seen.

The gradient is different for the case where the current prediction is wrong (the example is on the wrong side of the hyperplane), and for the examples that are correct.

It would be possible to implement this function using fewer lines, but in this case I opted for clarity. To check this possibility, see the formula in the original publication of this algorithm.

Now we create the function to train and evaluate the model in the data. In the case of online learning, we would receive the data through a stream, so they would be seen only once.

For each example of the training set we invoke the function ‘train’, which adjusts the weights, according to the learning rate. Being a method that uses the subgradient, it does not guarantee the reduction of loss in all iterations.

We predict an example as positive if the distance from it to the hyperplane is positive, and the opposite to negative examples. It is extremely rare that an example is absolutely 0, but we will consider that if it does, it will be classified as negative.

In addition, due to the stochastic nature of the algorithm, we will train it several times, and average the performance by modifying the order of the examples. In order for the results to be reproduced, we determine a seed for the RNG.

Training Set Influence

Because it is an algorithm that considers one example at a time, the order in which they are presented changes the behavior, causing it to search for different paths to reach the minimum. All numbers reported in this paragraph are evaluated in the “test.csv” samples.

Training in our original training set, without shuffling the data, the performance is as follows:

Now, to have a real estimate of performance in practice, let’s test it 100 times by modifying the order of arrival of the examples, the average performance in this case is:

A detail of this dataset is the fact that the proportion of classes is unbalanced (75/25). A simple remedy to this, if we have a lot of data is to reduce the proportion of the larger class to achieve equilibrium (50/50). In addition, some experts suggest that ensuring that the next example seen by the algorithm is different from the previous one may improve performance. Let’s try these two options by adding the following code:

It will cause the algorithm to always receive an example different from the previous one, and the proportion of classes to be approximately equal. Let’s not touch the test set, it continues with the original ratio (75/25).

In this case, using only the order of the original training set, the performance becomes:

And averaging 100 shuffled datasets:

We see a significant improvement in the number of true positive hits, despite a large increase in the loss. This increase is due to the fact that the loss is calculated based on all the examples. As we are “pulling” the hyperplane that separates the classes in order to accept more errors in the negative class, in exchange for a better performance in the positive, we end up increasing the distance of some incorrectly classified points.

An alternative would be to keep all the data and assign a greater, proportional, loss to errors in lower class examples.

In cases that have disproportionate classes, such as this, it is interesting to use metrics that are not sensitive to the number of examples. So the choice to report true positives / negatives.

Results and Characteristics

The SVM primal is a convex function, so we have the possibility to reach the global minimum point. However, the fact that we receive examples in a random way, training only once in each one, with a learning rate decreasing with time, can prevent us from reaching this point in a practical application. Despite this, there are theoretical guarantees of convergence for the global minimum.

In Machine Learning we seek to minimize the generalization error, that is, outside the examples used for training. In most cases, the best generalization performance occurs before we reach the minimum in the training set.

Using Machine Learning To Predict Which User Will Click An Ad

Avazu is an advertising platform that delivers ads on websites and mobile applications. One of the biggest challenges in the industry is determining which ad is most relevant to the user. If your ad is in accordance with the user’s interests, the chance of click is higher, and this increases the profits of both the platform and the advertiser, which will have a potential customer visiting his site.

Machine Learning have been used by several companies delivering ads on pay-per-click format. That is, they are paid by the advertiser for each click on the ad. Among the applications, the creation of intelligent models to calculate the likelihood of a user clicking on the ad, based on:

Profile information : location, device used;
Advertiser information : ad category, size, and color;

And also data about the site or app in which the ad is displayed.

In this repository I make available the original code for the implementation of FTRL Proximal by Kaggle’s user tinrtgu, based on the paper developed by Google, as well as the modified code to obtain my best single model.

Data Description

In this competition the Avazu made available a dataset with the log of 10 days of impressions and clicks with information about users, advertisers and display platforms. The goal was to predict the likelihood of a visitor, on an app or site, clicking on the displayed ad.

A day after the training set was kept as the test set to evaluate the model. The training set had about 40 million lines, and the test about 4 million. Furthermore, it was a high dimensional problem, with more than 1 million independent variables.

The metric chosen for evaluation was LogLoss, which heavily punishes inaccurate predictions that are very confident. We should keep in mind that there was an imbalance between the proportion of positive and negative examples, and this metric favors correctness in the most represented class (in this case, the negative ones).

Baseline model

Predicting a probability of 50% for all examples provided a log loss 0.6932. The first model I tested was a logistic regression with Vowpal Wabbit without regularization and only 1 pass over the data. It resulted in a log loss of 0.3993.

Cleaning the Data

Due to the large number of features and values, some of them didn’t occur frequently in the dataset, and this can hurt the classification. Logistic regression is sensitive to outliers in the features, so removing them could improve performance.

I tested two strategies: removing the outliers, or grouping them under a “rare” variable. The choice that significantly improved log loss was to group the values of features that were present in less than 5 examples.

The same logistic regression, without any optimization, applied to the “clean” dataset reached a log loss of 0.3954.

Validation Issues

In this competition one of the big problems was finding a suitable way to create a reliable validation set.

These data change rapidly: new variables come in, and old variables are no longer used in a matter of days or weeks, so it is natural that there is this difficulty in reaching a consensus on the exact performance of the model in the future.

Some ways were discussed in the competition forum, and two of them were widely used:

Progressive Holdout

In this modality, for each N examples used to train the model, the log loss was calculated in example N + 1. After going through the entire dataset, the average of these scores is calculated. In my attempts this proved to be a rather optimistic method.

Validation in the last days

Because this data has a time dependence, the model was trained excluding the last, or the last two days of the dataset. And then they were used for validation. This alternative showed itself more reliable than the other, but was not very robust to the choice of interactions between variables, so it was still necessary to be careful about overfitting in these cases.

Feature Engineering

All features of this competition were binary, and some were anonymous, so I had no information about what they represented.

I tried to count how many times each value of a variable appeared in the dataset and use this value as a new feature. It did not work, and it worsened performance. Although it worked for other teams.

An alternative that showed improvement, but nothing significant, was to create a variable indicating if a given example had more or less than 100 occurrences.

Interaction between variables

Another alternative was to create interactions between two or more variables. This was done simply by creating a variable that indicated the presence of combinations between values. At first I tried the interaction between all the pairs of variables, which worsened performance.

I made another attempt, this time manually creating combinations that seemed relevant to me (representing an user’s access to an app, for example). And they eventually improved the model, reducing the log loss to 0.3889.

Models Trained

Logistic Regression with Vowpal Wabbit

I took this opportunity to learn about a new tool: Vowpal Wabbit. A fast implementation of Machine Learning algorithms that minimize convex functions for both regression and classification. Just to keep in mind, the “pure” logistic regression, without data cleansing, had a log loss of 0.3993. After cleaning this number dropped to 0.3954.

Calibrated SVM with Vowpal Wabbit

Since it was necessary to send a list of probabilities to Kaggle, I tried to use the distance of the data to the SVM hyperplane as inputs, both for a logistic regression (inspired by Platt’s scaling) and for an isotonic regression, available in scikit-learn.

These are two popular ways to calibrate an SVM so that it gives us probabilities. None showed a good improvement in the score.

Follow the Regularized Proximal Leader (FTRL Proximal)

Google Paper – Ad Click Prediction

This was the algorithm that helped me greatly improve the log loss. It was implemented by one of the competitors and made available in the competition forum. Developed by Google for the same task of calculating ad click probability based on user information, it creates a more sparse representation of the data, and ends up being more robust against outliers. In this paper the author describes the implementation and characteristics of it compared to other algorithms used for the same task.

It can be said that it is basically a logistic regression with adjustments to not need to use as much memory to store the weights, and also an optimization method that forces the least significant weights to take the absolute zero value. That is, in a problem with millions of variables like this, disregards those that are not really important and, in this way, automatically selects features.

With the clean data, a slight tuning, and 3 passes on the data, this model hit a log loss of 0.3925.

Neural Networks

I tried to use neural networks both in Vowpal Wabbit and my own implementation. In Vowpal Wabbit the neural network hos only one hidden layer with sigmoid activation. It didn’t show an improvement.

I created, in Python, a neural network with ReLu activation units. This type of activation has been widely used in the area of Deep Learning, due to the fact that it does not have the problem of exploding or disappearing gradients, besides favoring sparse data representation. In some cases the result is equivalent to, or better than, networks with stacked unsupervised layers.

I only used a hidden layer, and in this case, there was an improvement in the validation data, but it did not translate into an improvement in the official competition score. Maybe using more than one hidden layer, and getting into the Deep Learning area would have helped, but I did not have time to test it.

The best neural network, in the original clean data, with 30 neurons in the hidden layer, and ReLu activation, reached the log loss of 0.3937.

Hashing trick

In a problem with many independent variables like this, storing weights for all of them in memory becomes an issue. Although this is a small dataset when we talk about big data, it already requires the use of the hashing trick.

This technique consists in hashing the features values and assigning weights to the buckets, instead of directly to each column. In this case, because we only have binary variables, it is fairly easy to use.

After the hashing is done, at each iteration of training we update the weights of the buckets. There is a risk of collision, but the higher the number of buckets, the lower the risk.

In practice there is no significant loss of performance for this reason, so this has become a widely used technique in problems involving high dimensionality. Both Vowpal Wabbit and Proximal FTRL use this technique.

Ensembling

It is almost impossible to score well in a Kaggle competition without putting many models together in an ensemble. Basically, if you have models that have similar accuracy but different errors, there is a good chance that if you join their predictions, you will have better performance.

Models in partitions of the data based on independent variables

The first attempt at ensembling was to create individual models for certain partitions of the dataset. For example, create 24 datasets, each containing only the samples at a given time, or create a dataset for each application ID, and train individual models in them. Then subdivide the test set in the same way and get predictions from these individual models.

In some cases, if partitions are made into groups that actually differ between them, performance can improve greatly.

After creating several models based on subgroups that seemed to be different between them, the final ensemble, taking the simple average of 8 models, reached a log loss of 0.3935.

Models based on random interactions of the variables

Another alternative to create ensembles is to use different variables for each model. After checking that some interactions improved the score using Proximal FTRL, I decided to create a script that tested the effect that an interaction would have on the ensemble score. Although interesting, one should be careful about overfitting.

Each model selected an interaction and tested whether the performance of an ensemble between it, and some model with other interactions, improved. This attempt generated 5 linear models with different combinations of variables, which together, taking the simple average, reached a log loss of 0.3888.

Putting together the best models

In the end, I put together several models that showed a good score, and reached the log loss of 0.3878 that guaranteed the 42nd position, among the top 3% solutions.

The difference between the log loss of this solution and the winner was 0.0087.

Other ideas and the winning solution

After the end of the competition it is common for the participants with the best positions to disclose their models and ideas that contributed to achieve a good performance.

In addition to the methods described here, two approaches stood out: attributes that took into account temporal factors, such as the number of visits a user made to a site or application in the last hour; number of visits on the same day; and using attributes related to the count of times the variable, or interactions between variables, appeared. Knowing how to use the temporal characteristics of the data seems to have favored the winners.

Using Machine Learning to Discover Illegal Ads in Russian, While Not Knowing Russian

The site Avito.Ru, founded in 2007, is one of the 5 largest Russian classifieds websites. It is used by both individuals and businesses to negotiate new and used products. One of the biggest challenges is maintaining the quality of content in the face of growth and the volume of new ads. To resolve this, the site decided to seek a solution based on the available data.

This was a competition ran on the Kaggle platform.

For readers with technical knowledge, the final solution code is available at this link (github).

Available Data

In the available database was the content of ads, attributes about the advertised product, category and sub-category, title and description. The variable that needed to be predicted was binary: whether the ad would be blocked or not by moderation. About 4 million ads with predictions given by human moderation were made available for training, and 1.5 million for testing.

In addition there was a variable indicating whether that ad had been blocked by an experienced moderator, which could make a difference in the human error factor.

Evaluation Metric

The metric used to evaluate the best solution was AvgPrecision @ K (Top at K). After assigning a score indicating the likelihood of a particular ad to be breaking the rules, they should be ordered so that the most likely were higher in the rankings. Once this was done, the system considered the top K ads and compared them with the actual true values to decide the correct hit rate of the model.

Data Transformation

My solution involved only the title and description of the ads. First the basics, put all the words in lowercase, remove stopwords and stem. One detail is that the ads were in Russian, so I do not know exactly which stopwords have other meanings (in English, “can” is a stopword, but it’s also used as a noun). Either way, these procedures improved the model’s performance. Also, I turned the documents into a numeric matrix where each line was an ad, and each column a word. For the element values I tested three variations:

– Binary: where the presence of the word in the ad was indicated with the number 1;
– Count: each value was the number of times the word appeared in the ad;
– Tf-Idf: each value was based on a formula that takes into account the frequency of the word in the ad, and also in relation to other ads. It assigns a higher value to rare words in the overall context, but which have a high frequency within the specific ad.

Among these alternatives, the one that demonstrated the best performance was the Tf-Idf. This is a technique widely used for text classification, and generally shows improvement in classification accuracy over the other options.

Despite making some attempts to clean the data, like removing the rarest words, the performance was negatively affected. What helped a little was removing numbers.

Solution

My focus was to create a simple solution, so my first trial was with Logistic Regression on all data. One of the factors to take into account was the distribution of positive and negative examples, which was not 50% for each. With this solution the accuracy was 91.8%.

After testing a few options available in scikit-learn (machine learning library in Python that I used), I found that using the “modified_huber” loss function created a more accurate model. This is a more robust function than log loss, since it applies quadratic penalty for small errors, and linear for large ones.

Another idea that helped a lot was to separate ads by category. Each category had a different proportion of positive and negative examples (Some with less than 1% and others with more than 30%). Applying the above algorithm to the transformed data in this way got 97.3% score. An improvement of 10.6%.

For the final solution, I also trained a Naive Bayes model that, despite assuming that the variables are independent, has a good performance for text classification. By averaging the solution of the two algorithms, I achieved the final score of 97.9%.

Differences to the winning solution

Comparing my solution with the solution of the winning team, there is a small difference of 0.8% in the score. But when we look at complexity, the winning solution used more than 30 models, between transformations and classification, for each example of the training set. In practice, as is normally the case in these competitions, it would not be worth implementing the winning solution. But this does not take out credit from the winners, and also the opportunity to learn.

Teste