Machine Learning for Sales Forecasting Using Weather Data

Machine Learning for Sales Forecasting Using Weather Data

WalMart is a company with thousands of stores in 27 countries. It is possible to find several articles on the technological mechanisms used to manage the logistics and distribution of the products. It is the second time they offer a contest at Kaggle with the intention of finding interview candidates for data scientist jobs.

A major advantage of this type of competition is that we have access to data from large companies, and understand what problems they are trying to solve with probabilistic models.

The objective of the competition was to create a model that could predict the amount of sales of some products in specific stores in the days before and after blizzards and storms. The example given by them in the description of the task was the sale of umbrellas, which intuitively must see an increase before a great storm.


Two files were made available to train the model: one of them contained information about the identification of the stores, products, and the nearest weather stations. The other contained weather data for each station.

In total, data were available on 111 products whose sales could be affected by climatic conditions, 45 stores, and 20 weather stations. The goal was to predict the amount of each product in each store that would be sold 3 days before, 3 days later, and on the day of the weather event.

A weather event mean that more than an inch of rain was recorded, or more than two inches of snow, that day.

The combination of stores and products gave us about 4.6 million examples for training, and about 500,000 for testing. Each example referred to a day, store and product.

Evaluation Metric

The metric used was the Root Mean Square Log Error. It is basically the RMSE applied to the log (X + 1) transformation of the predictions. This means that errors in predictions that should be close to 0 would be punished more severely than errors in predictions with higher numbers. For example, predicting 5 items when it should be 0, has a greater penalty than predicting 105 when it should be 100.

Transforming the Data

Since I only had 10 days to work in this competition, I decided to check how far it was possible to come up with a model based only on the original variables and, who knows, make an ensemble.

One difference from this to other competitions is that, even though the data was organized, you were responsible for linking climate variables to the identified product and store data. It makes perfect sense because Walmart would not want a data scientist who does not know how to manipulate data.

For this I used Pandas . This Python library for data manipulation is one of the most widely used, and strongly remembers data structures available in R.

At first I used all the variables as numerical, trained an XGBoost with slight tuning, excluding the variables that coded special alerts, and used 10% of the dataset to determine the number of trees. As expected, the result was poor, about 0.1643 on LB.

Binarizing variables

After testing the first model, I coded the categorical variables with one-hot encoding. That is, for each level of the variable a column with the indicator 0 or 1 was created, if the variable was present in that example. Normally the number of columns should be the number of levels minus one, so that you do not have problems with collinearity. Since I planned to use models that were not sensitive to this problem, I did not bother deleting one of the columns.

After tuning using a subset with 10% of the data, and getting the RMSLE of 0.1216 in the validation, I sent the solution, and got the value of 0.1204 in the LB, a good improvement over the previous one.


A lot of weather data were missing, so I decided to test a simple imputation method: replace the NaNs by the mean of the column values. After tuning again, now with parameters for these new values, I obtained the RMSLE of 0.1140 in the 10% validation and 0.1095 in the LB.

Time Variables

I did not explore the temporal dimension of the data very much, in one of the attempts I added the previous day’s meteorological information, which reduced the error to 0.1083.

Data Subsets

One method that worked very well in my first competition, and that I always try to do, is to divide the training set into small subsets related to some variable and train a specific model for each one. In this case, I decided to create 19 models, one for each of the 19 weather stations present in the test set. The RMSLE in LB was 0.1101. Worse than with a model that treats the stations as variables in the entire dataset.

A serious problem with this approach is to try to use the same model, with the same parameters, for different datasets. Knowing this, I decided to make a small tuning of the parameters for each dataset, which reduced the LMS RMSLE to 0.1069.

Despite the small difference, it seems the division into individual models for each station captured some information that was not present in the model with all considered together.


Of the models I tested, two stood out: Gradient Boosted Trees (XGBoost) and Random Forests.

Random Forest

I had used Random Forest for regression only once, in a job, but never with a large amount of data. After tuning the parameters, applying the model to the imputed data, it resulted in a RMSLE of 0.1166 on the LB.


XGBoost presented the best error of an individual model. In addition to adjusting parameters, such as the depth of the trees, using a subset of the data, it was necessary to adjust the number of trees and the learning rate. Usually a small learning rate and a lot of trees is the safe recipe to improve performance, in exchange for more time to train the model.

Since XGBoost is sensitive to the seed of the RNG, I decided to make an ensemble of XGBs by just changing this value. This method marginally improved my score in other competitions, but in this the impact of it was greater because of the following: XGBoost has a function that allows to leave data separated so that it determines the number of trees that minimizes the error. In this case I decided to use the function, and I left 5% of the data separated. In addition to varying the seed to the XGB itself, I varied the seed to split the data, which made the models look more diverse, which is essential for a good ensemble.

The most I got to train was 10 XGBoosts in this scheme. Although it was a very stable model, the RMSLE of the ensemble was 0.1041, presenting a reduction compared to 0.1095 of the single model.

Final Solution

In the end, I put together all the solutions that I had submitted, and ended up getting a RMSLE of 0.1028, guaranteeing a position among the top 20%.

Possible Improvements

One day after the end of the competition I reviewed the variables of my XGBoost, and found that the variables that identified the products (item_nbr) were not in binary format, and were considered by him the most important. With the correct coding I believe that it would be possible to reduce the error more, and achieve a better final position. Although trees are very good to capture patterns even with categoricals in this format.

Classifying 150 Thousand Products in Different Categories Using Machine Learning

The Otto Group is one of the largest e-commerce companies in the world.
According to them, due to the diversity of the company’s global infrastructure, many similar products are classified into incorrect categories. So they made available data on approximately 200 thousand products, belonging to nine categories. The objective was to create a probabilistic model that correctly classified the products within each category.


For training, about 60 thousand products were available and, for testing, about 150 thousand.

The 93 features were not identified, the only information given about them is that they were numerical variables. This makes it difficult to understand the best way to work with the data, and also the work of creating new features.

In addition, the task involves multiple classes, that is, we have to generate probabilities for 9 classes.


The metric chosen for this competition was the multi-class log loss.

This metric punishes very confident probabilities in wrong classes, and is sensitive to imbalance between classes. Some of them represented 20% of the training data, while others, 9%. In this case, trying to balance classes, whether with subsampling or penalizing smaller classes, will not help.

Feature Engineering

Even with anonymous variables, I decided to explore the possibility of integrating interactions between them, which might not be captured by the models.


I tried to create new features based on sums, differences, ratios and products between the originals. The only feature that contributed significantly was the sum of all attributes for a given product. Most of the attributes were pretty sparse (had more zeros than other values), so maybe that contributed to the lack of success.

Using a GBM / RF to Select Features

As we are talking about a gigantic space of possibilities of combinations of attributes, I decided to use a technique used by a participant of another competition and published in this link: http://trevorstephens.com/post/98233123324/armchair-particle-physicist

Basically it consists of creating some datasets with the desired interactions and training a model of Gradient Boosted Trees in each one. After that, we check which are the most important features in each dataset. The logic is: if a feature is important in several different datasets, it should be important overall.

I tried it, using 2 stratified divisions, but although they agreed on the importance of some interactions, they did not improve my base model. As I had already spent days dedicating myself to this part, I decided that I would focus more on fitting models for an ensemble.

If I had continued, perhaps selecting the X best interactions and re-adjusting the hyperparameters of some model could extract value from these variables. Or cause terrible overfitting.


Gradient Boosted Trees (XGBoost)

My individual model with the lowest log loss was created with XGBoost, which is a fast and parallel implementation of the powerful Gradient Boosted Trees. This model has already been in winning solutions for a lot of competitions and usually has superior performance.

To find the best attribute group I used a random search. I fixed the learning rate in 0.05 and varied attributes such as tree depth, minimum examples that should compose a node, and the proportion of examples and variables that the algorithm should randomly select to create each tree.

Usually the lower the learning rate, the better the accuracy, but more trees are needed, which increases training time. In the end, I left this value at 0.01 and trained 2000 trees.

These attributes control overfitting. After finding the best attributes, the log loss of cross-validation in 3 divisions was 0.4656, and in the leaderboard, 0.4368.

Random Forests

Although I didn’t have initial success with Random Forests in this competition. One suggestion given in the forum was to calibrate the predictions. Recently the scikit-learn developer team has made available a new tool that allows us to adjust the output values ​​of a model so that they become closer to the real probabilities.

A Random Forest usually makes a vote among its components, and the proportion of each class is given as the probability of the example belonging to that class. These proportions do not necessarily match the actual probabilities of the events, so we will only have true probabilities if we calibrate.

As this competition asked us to predict the likelihood of a product belonging to a category, it was very important to have the probabilities adjusted correctly. In this case, I used the new scikit-learn tool with 5 data splits.

This greatly improved Random Forest’s predictions, and although they also use decision trees, making them similar to the GBM, they ultimately contributed significantly to an ensemble.

Neural Networks

I had the opportunity to learn two great modules to train neural networks in a simple way in Python. They are Lasagne and NoLearn. With them, it is simple to create and train state-of-the-art neural networks, including using GPUs for processing.

But make no mistake, although they facilitate implementation, a neural network requires a lot of decisions to become useful. We have to determine the architecture, the methods of initialization of weights and regularization. Some researchers suggest doing a random search for initial parameters and continue to adjust manually.

In this case, the architecture that served me well was the following: 3 layers of neurons with the following number of units in each one 768-512-256. I trained two versions, one of them with a dropout of 0.5 between the hidden layers. And another with dropout of 0.5 only on the first hidden layer. The hidden layer units were ReLu, and the output unit, softmax.

An interesting interpretation of another competitor who came to a similar architecture is that the first hidden layer allows the network to make random projections of the data, and the smaller layers that come after it, seek a synthesis of the information. Neural networks are naturally difficult to interpret, but I found this a very interesting point of view.

Finally, since the solution of a neural network is a local minimum, and dependent on the initial weights (and we have many initial weights), I decided to average 5 neural networks with different initial weights. This average gave me a score of 0.4390 on LB, comparable to XGBoost.


I still tried to train: SVM, Nearest Neighbors, Logistic Regression, but none of them performed well or contributed significantly to the ensemble.

Adjusting the Hyperparameters

Since we did not have access to the contents of the variables, it was vital to adjust the parameters to have the best models. At first I did a random search with reasonable values, but would allow the models to explore the space a bit.

Usually the hyperparameters of a model are not independent, so adjusting one at a time will not get you the best combination. It is therefore important to perform a scan with parameter combinations. Unfortunately most of the time it is impossible to test all possible combinations, but when doing the random search we can get an idea of ​​the solution space, without having to explore it exhaustively.

Once this exploration is done, it is good to manually vary some of them, and see if it is possible to improve performance on some neighboring combination.


In the end my solution was a weighted average of the following models:

– Gradient Boosted Trees (XGBoost) on the original variables and the sum of the columns.
– Random Forest on the original variables, log (X + 1), and sum of the columns.
– Neural network with 3 layers and dropout of 0.5 in each, on the original variables, log (X + 1) of them, and sum of the columns.
– Neural network with 3 layers with dropout of 0.5 in the first hidden layer, without dropout in the second, on the original variables, log (X + 1) of them, and sum of the columns.

In addition to these models, I’ve added a bias, simply predicting the proportion of each class as a probability of an example belonging to them.

This solution hit the log loss of 0.4080, and guaranteed me the 35th position among 3514 teams (Top 1%).

Winning solutions

The winning team’s solution consisted of a three-layered ensemble:

In the first layer 33 different models were trained, varying both the type of model and the variables used to train each one. Some were multiple models trained with bagging.

In the second layer, the predictions of these models were used to feed an XGBoost, a neural network, and an ExtraTrees with Adaboost.

In the third, and last, layer, they bagged the three models of the second layer (totaling 1100 models) and then did a weighted average among them.

In addition, new variables were created, based on the distance of the example to the closest examples of each class, as well as others based on the TF-IDF and T-SNE transformations of the original dataset.

Using Machine Learning to Identify Drivers From GPS Data

In recent years the insurance industry has been looking for ways to improve its models using Machine Learning. One is to use data that goes beyond a form completed by the insured to determine the risk of accidents.

One of the methods used is to use driver behavioral data obtained through GPS tracking. In this way it is believed that it is possible to capture information and profile patterns that go beyond traditional methods.

What makes a driver different from the other? Identifying the driver who is driving a vehicle during a trip is the first step in getting a good model.

Competition Description

The insurance company AXA decided to provide anonymous trips of approximately 2700 drivers. For each driver there are 200 trips, some of which are not from the driver in question, and the task was to create a model that would identify which trips were attributed to this driver incorrectly.

To avoid information leakage, the data has been centralized and rotated in a random manner, and some portions of the trips removed from the beginning and the end.


In total, 200 trips of 2736 drivers were made available. Within the folder of each driver was a random amount of trips falsely attributed to the driver in question. An important information is that we had the assurance that most trips truly belonged to the driver.

Each trip was described in a CSV file with a numerical value for the position x and y, which would be the distance from the origin, in meters, and each line corresponded to a 1 second displacement.


The metric used for evaluation was the area under the ROC curve, abbreviated as AUC.


The features I used are based primarily on the first three derivatives of the displacement: speed, acceleration, and jerk. Other derivatives undermined performance.

Frequencies – FFT

I tried to extract frequencies from both raw raw data and derivatives. I did this with the FFT (Fast Fourier Transform) function available in Numpy. I divided the frequency range by 20, and the value of the features was the sum of the frequencies in their respective buckets.

In this case I extracted the isolated frequencies for each component (x and y). Maybe getting the frequencies using speed and acceleration with the two components (x and y) together, would have been better, but I did not test this hypothesis.

The maximum AUC I achieved, using frequencies of the three derivatives, was 0.7058, with a logistic regression without parameter optimization.


This is the transformation that found in all academic papers I read on the subject. It consists of making a histogram of attributes. In this case, we get a new parameter, the size of the intervals.

The best way I found was to divide the magnitude of the velocity in 50 intervals, and the acceleration in 10.

Two details that improved performance: using the “density” attribute of the Numpy function, which calculates the probability density function, normalizing the results so that their sum is 1.


Another frequent suggestion was to use the value of attributes at certain percentiles. This helps you figure out how extreme the values are.

An example may facilitate understanding: in the case of velocity, in order to obtain the value in the 5th percentile, I ordered the values incrementally and located the list point that represented the border of the first 5% of the ordered data.

If we have a list with 100 ordered velocity values with an index ranging from 1 to 100, we would get the value at position 5. The value at position 50, for example, would be the median.

This is important because it differentiates drivers who have a faster or slower profile.

I used the 5th to 95th percentiles, with intervals of 5.

Repeated Trips

One approach that was important in getting the best teams to the top of the leaderboard was to create an algorithm to find similar trips. It was remarkable that some stretches, although they had a different angle with respect to the origin, were the same.

I created a simple algorithm, based on the speed of the driver during the trip but, despite having found some similar trips, it ended up hurting the score.

The algorithms that succeeded were more complex, and in some cases involved the transformation of variables, reduction of dimensions, and limits to determine similarity. Of all, two methods seemed essential:

Ramer-Douglas-Peucker : because each trip had a different duration, this method was used to reduce the number of points.
Dynamic Time Warping : This method was used to compare parts of the trip and determine the similarity between them.


As it was not possible to really know what were the travels of each driver. That is, there were trips marked positive, when in fact they were negative, there was the opportunity to decide the best way to sample the data.


At first, with unsupervised methods, I used only the 200 trips of each driver. One suggestion in the competition forum was to use the 200 trips of a driver as positive examples and use trips of other drivers as negative examples.

Overall this was a very good approach. Among the ways of doing it were:

-Randomly pick an N amount of drivers and treat all of their trips as negative.
-Choose, randomly, a V amount of trips from N different drivers, to use as negative.

The approach that worked best for me was the second option: selecting a small number of trips from several different drivers.

Similar / different samples

Another alternative that I tested was to select negative samples of drivers that were more similar or different from the positive samples, according to a measure like Euclidean distance.

I created attributes such as the average speed, distance traveled, and duration of trips for each driver and determined which were the most similar, and which were the most different of the driver in question.

Although the approach using the “most different” drivers showed improvement during cross-validation, it did not improve the score in the test set. The reason I attribute to this is that the validation data contains travels that are false, but are marked as true, whereas in the test set we have the absolute reality.

Reinforcing predictions

I tried to reinforce the model predictions by training more than once, but now using the predictions from the previous model instead of the original dependent variable, as follows:

In the first pass I did the normal procedure, using the driver’s trips as positive, and others randomly selected from other drivers as negative to obtain forecasts.

In the next passes I used the classes provided by the previous model as dependent variable. This improved greatly in cross-validation, but showed a very small improvement in Kaggle’s validation data.

Models and Methods


At first I decided to group the trips according to similar characteristics. This idea was based on the assumption that, since most trips belonged to the driver, it would be possible to separate the trips into two clusters, and the one with less trips would be the “negative.”

Another idea was to increase the number of clusters, since there were differences of between the trips, even when they belonged to the same driver.

Either way, these ideas proved useless, with AUC close to 0.5.

Another unsupervised way I tried was to calculate the cosine similarity between the average of the attributes of the trips and the individual records. In this case the “probability” of a trip belonging to the driver would be one minus this value. Which also proved useless.

Logistic Regression

My most successful model was logistic regression. This surprised me, since the success described by several participants in the forum involved boosted trees.

I used about 1000 examples, 200 positive and 800 negative. An important detail was to enable the option “class_weight = auto”. This option assigns a different penalty to the examples, according to the class distribution.

In this case, an error or hit in the positive class was worth more than the same in the negative class. In this way, although we had more positive examples, the penalty for errors was balanced.

I trained a regression for each driver, personalizing the regularization penalty (L1 or L2) and its coefficient. Each driver model performed a search for the values for this coefficient using the two types of regularization, and at the end selected the type of regularization and the coefficient that presented the best AUC in cross-validation.

The best individual model reached the AUC of 0.86.

Random Forests / Boosted Trees

According to the forums, the most used model was Gradient Boosted Trees. Basically it consists of training a sequence of decision trees, assigning a greater weight to examples that were incorrectly classified by previous trees. The probability of a specific example is a linear combination of the predictions of each tree.

In my case the attempts to use this model were not successful. Two reasons may have contributed:

– For lack of confidence in validation, I ended up not paying much attention to optimizing his parameters to achieve a good AUC.

– The cross-validation process to find good parameters was going to take a long time, and preliminary tests had not yielded good results.

Anyway, I trained a model of Random Forests, with the standard parameters, to use in an ensemble. It achieved an AUC of 0.847.


Towards the end of the competition I decided to go to the “easy” way to improve the score: ensembles.

For a part of the ensemble I trained the following models:

– Logistic regression with L1 regularization only
– Logistic regression with L2 regularization only
– Linear SVM
– Random Forests

Although SVM did not offer probabilities, I thought it would be a good contribution to the ensemble, which proved to be true.

The simple mean of the predictions of these models produced an AUC of 0.8825.

Finally, I joined these with everything else I had produced since the start of the competition, for a final AUC of 0.8915. This guaranteed me a final position among the 15% best solutions of the competition.

Methods of Winning Solutions

The winning solutions used 400 examples for each driver, selecting random trips from other drivers as negative examples. The most used model was Gradient Boosted Trees.

In addition, complex algorithms to find similar trips and assign them directly to the driver have squeezed the necessary points to rank among the top 10 solutions.

How to Create a Machine Learning Model for Big Data in 10 Minutes Using Python

Support Vector Machines are popular models in machine learning applications. Although they follow simple principles, they have already proven to be very powerful in various tasks and datasets. In this article I want to demonstrate how to implement an SVM capable of handling data that arrives in real time without having to store them in memory.

Click here to access the code (svm_english.py) and dataset for this article. I recommend running it using Pypy.

Main ways to train a machine learning model

There are three popular ways to train a model: batch learning, mini-batch learning, and stochastic learning.

Batch Learning : In the first mode, we store all the training data in an array and feed it to the algorithm, reducing the loss function based on all the examples at once. This is not always possible due to the size of the dataset. In these cases we have to resort to the two other ways.

Mini-Batch Learning : In this case, we select a N number of examples and divide the training set into blocks of this size. So we train the model in one block at a time.

Stochastic learning : this is a variation of the mini-batch, but with N = 1. We use only one example at a time to train the model. This is the preferred mode in big data solutions.

And we have two main divisions:

Offline : In this case we have all the data stored in a database, but it does not fit in memory, so we need to train on one example (or batch) at a time.

Online : In this case we receive the examples in real time, we train and we discard, without the need to store them.

In addition to not having to load all the data in memory to train the model, in some cases, using algorithms that train on one example at a time may be faster than the format that uses all the data at the same time.


This article will use the Skin Segmentation Data Set that can be downloaded from the UCI database. This data refers to a classification task in which RGB values of random points are extracted from the face picture of a person, and the task is to determine, according to these values, whether that point corresponds to the skin or another part of the image.

According to the researchers, images of people from various ages, races and genres were collected. A practical application of this task would be to identify images with inappropriate content for minors on the internet.

There are about 245,000 examples, and it’s a binary classification problem.

Training Algorithm

The algorithm used for training will be PEGASOS. Academic work with the details can be found here .

In its simplest version, we will minimize the primal function of the SVM using one example at a time. As this method applies to the subgradient, it does not guarantee error reduction in all iterations. Still, there are convergence guarantees.

The only parameter to choose is the constant that measures the degree of regularization. Usually it is chosen using some validation method. As the purpose of the article is to demonstrate an implementation of SVM that can be easily adapted to data that does not fit in memory, or are received in real time, I will set it to 0.1.

In an actual application, it may be interesting to adjust this parameter using a smaller sample of the data, offline, with cross-validation.

Evaluation of the result

In a real online environment, one way to evaluate the performance of the model would be to calculate the loss before predicting a new example and averaging in a time window. This method is called progressive validation.

In our case, we are simulating an online environment, and to make the model more effective, I will use a separate test set with 20% of the data.

In the repository of this article you will find the files train.csv and test.csv, already prepared. Each contains approximately 25% positive examples, so in addition to loss, I want to evaluate the number of correct answers and errors in each class (real positives and true negatives).

The training set has 196,045 examples, and the test set, 49,012. We will use each example of the training set only once, and then evaluate the performance of the model in the test set.

Since we are not going to use this test set to adjust hyperparameters, it is a legitimate estimate of the model’s performance in new examples. If you are going to use it to validate parameters, you will need another test set.

Then we go to the implementation of the algorithm to train the SVM. We created some support functions, one to calculate the hinge loss, which is the loss we are minimizing, and another to determine the SVM output signal. The initial weights of the SVM can be 0, since we do not have problems with symmetry, as in neural networks.

Then we create a function to generate forecasts. It will simply take the dot product between the vector of weights and the vector of attributes.

Now we go to the function that will train the model. We will update the weights with the learning rate (alpha) and the gradient. This function trains one example at a time. The learning rate adjusts, being smaller with each new example seen.

The gradient is different for the case where the current prediction is wrong (the example is on the wrong side of the hyperplane), and for the examples that are correct.

It would be possible to implement this function using fewer lines, but in this case I opted for clarity. To check this possibility, see the formula in the original publication of this algorithm.

Now we create the function to train and evaluate the model in the data. In the case of online learning, we would receive the data through a stream, so they would be seen only once.

For each example of the training set we invoke the function ‘train’, which adjusts the weights, according to the learning rate. Being a method that uses the subgradient, it does not guarantee the reduction of loss in all iterations.

We predict an example as positive if the distance from it to the hyperplane is positive, and the opposite to negative examples. It is extremely rare that an example is absolutely 0, but we will consider that if it does, it will be classified as negative.

In addition, due to the stochastic nature of the algorithm, we will train it several times, and average the performance by modifying the order of the examples. In order for the results to be reproduced, we determine a seed for the RNG.

Training Set Influence

Because it is an algorithm that considers one example at a time, the order in which they are presented changes the behavior, causing it to search for different paths to reach the minimum. All numbers reported in this paragraph are evaluated in the “test.csv” samples.

Training in our original training set, without shuffling the data, the performance is as follows:

Now, to have a real estimate of performance in practice, let’s test it 100 times by modifying the order of arrival of the examples, the average performance in this case is:

A detail of this dataset is the fact that the proportion of classes is unbalanced (75/25). A simple remedy to this, if we have a lot of data is to reduce the proportion of the larger class to achieve equilibrium (50/50). In addition, some experts suggest that ensuring that the next example seen by the algorithm is different from the previous one may improve performance. Let’s try these two options by adding the following code:

It will cause the algorithm to always receive an example different from the previous one, and the proportion of classes to be approximately equal. Let’s not touch the test set, it continues with the original ratio (75/25).

In this case, using only the order of the original training set, the performance becomes:

And averaging 100 shuffled datasets:

We see a significant improvement in the number of true positive hits, despite a large increase in the loss. This increase is due to the fact that the loss is calculated based on all the examples. As we are “pulling” the hyperplane that separates the classes in order to accept more errors in the negative class, in exchange for a better performance in the positive, we end up increasing the distance of some incorrectly classified points.

An alternative would be to keep all the data and assign a greater, proportional, loss to errors in lower class examples.

In cases that have disproportionate classes, such as this, it is interesting to use metrics that are not sensitive to the number of examples. So the choice to report true positives / negatives.

Results and Characteristics

The SVM primal is a convex function, so we have the possibility to reach the global minimum point. However, the fact that we receive examples in a random way, training only once in each one, with a learning rate decreasing with time, can prevent us from reaching this point in a practical application. Despite this, there are theoretical guarantees of convergence for the global minimum.

In Machine Learning we seek to minimize the generalization error, that is, outside the examples used for training. In most cases, the best generalization performance occurs before we reach the minimum in the training set.

Using Machine Learning To Predict Which User Will Click An Ad

Avazu is an advertising platform that delivers ads on websites and mobile applications. One of the biggest challenges in the industry is determining which ad is most relevant to the user. If your ad is in accordance with the user’s interests, the chance of click is higher, and this increases the profits of both the platform and the advertiser, which will have a potential customer visiting his site.

Machine Learning have been used by several companies delivering ads on pay-per-click format. That is, they are paid by the advertiser for each click on the ad. Among the applications, the creation of intelligent models to calculate the likelihood of a user clicking on the ad, based on:

Profile information : location, device used;
Advertiser information : ad category, size, and color;

And also data about the site or app in which the ad is displayed.

In this repository I make available the original code for the implementation of FTRL Proximal by Kaggle’s user tinrtgu, based on the paper developed by Google, as well as the modified code to obtain my best single model.

Data Description

In this competition the Avazu made available a dataset with the log of 10 days of impressions and clicks with information about users, advertisers and display platforms. The goal was to predict the likelihood of a visitor, on an app or site, clicking on the displayed ad.

A day after the training set was kept as the test set to evaluate the model. The training set had about 40 million lines, and the test about 4 million. Furthermore, it was a high dimensional problem, with more than 1 million independent variables.

The metric chosen for evaluation was LogLoss, which heavily punishes inaccurate predictions that are very confident. We should keep in mind that there was an imbalance between the proportion of positive and negative examples, and this metric favors correctness in the most represented class (in this case, the negative ones).

Baseline model

Predicting a probability of 50% for all examples provided a log loss 0.6932. The first model I tested was a logistic regression with Vowpal Wabbit without regularization and only 1 pass over the data. It resulted in a log loss of 0.3993.

Cleaning the Data

Due to the large number of features and values, some of them didn’t occur frequently in the dataset, and this can hurt the classification. Logistic regression is sensitive to outliers in the features, so removing them could improve performance.

I tested two strategies: removing the outliers, or grouping them under a “rare” variable. The choice that significantly improved log loss was to group the values of features that were present in less than 5 examples.

The same logistic regression, without any optimization, applied to the “clean” dataset reached a log loss of 0.3954.

Validation Issues

In this competition one of the big problems was finding a suitable way to create a reliable validation set.

These data change rapidly: new variables come in, and old variables are no longer used in a matter of days or weeks, so it is natural that there is this difficulty in reaching a consensus on the exact performance of the model in the future.

Some ways were discussed in the competition forum, and two of them were widely used:

Progressive Holdout

In this modality, for each N examples used to train the model, the log loss was calculated in example N + 1. After going through the entire dataset, the average of these scores is calculated. In my attempts this proved to be a rather optimistic method.

Validation in the last days

Because this data has a time dependence, the model was trained excluding the last, or the last two days of the dataset. And then they were used for validation. This alternative showed itself more reliable than the other, but was not very robust to the choice of interactions between variables, so it was still necessary to be careful about overfitting in these cases.

Feature Engineering

All features of this competition were binary, and some were anonymous, so I had no information about what they represented.

I tried to count how many times each value of a variable appeared in the dataset and use this value as a new feature. It did not work, and it worsened performance. Although it worked for other teams.

An alternative that showed improvement, but nothing significant, was to create a variable indicating if a given example had more or less than 100 occurrences.

Interaction between variables

Another alternative was to create interactions between two or more variables. This was done simply by creating a variable that indicated the presence of combinations between values. At first I tried the interaction between all the pairs of variables, which worsened performance.

I made another attempt, this time manually creating combinations that seemed relevant to me (representing an user’s access to an app, for example). And they eventually improved the model, reducing the log loss to 0.3889.

Models Trained

Logistic Regression with Vowpal Wabbit

I took this opportunity to learn about a new tool: Vowpal Wabbit. A fast implementation of Machine Learning algorithms that minimize convex functions for both regression and classification. Just to keep in mind, the “pure” logistic regression, without data cleansing, had a log loss of 0.3993. After cleaning this number dropped to 0.3954.

Calibrated SVM with Vowpal Wabbit

Since it was necessary to send a list of probabilities to Kaggle, I tried to use the distance of the data to the SVM hyperplane as inputs, both for a logistic regression (inspired by Platt’s scaling) and for an isotonic regression, available in scikit-learn.

These are two popular ways to calibrate an SVM so that it gives us probabilities. None showed a good improvement in the score.

Follow the Regularized Proximal Leader (FTRL Proximal)

Google Paper – Ad Click Prediction

This was the algorithm that helped me greatly improve the log loss. It was implemented by one of the competitors and made available in the competition forum. Developed by Google for the same task of calculating ad click probability based on user information, it creates a more sparse representation of the data, and ends up being more robust against outliers. In this paper the author describes the implementation and characteristics of it compared to other algorithms used for the same task.

It can be said that it is basically a logistic regression with adjustments to not need to use as much memory to store the weights, and also an optimization method that forces the least significant weights to take the absolute zero value. That is, in a problem with millions of variables like this, disregards those that are not really important and, in this way, automatically selects features.

With the clean data, a slight tuning, and 3 passes on the data, this model hit a log loss of 0.3925.

Neural Networks

I tried to use neural networks both in Vowpal Wabbit and my own implementation. In Vowpal Wabbit the neural network hos only one hidden layer with sigmoid activation. It didn’t show an improvement.

I created, in Python, a neural network with ReLu activation units. This type of activation has been widely used in the area of Deep Learning, due to the fact that it does not have the problem of exploding or disappearing gradients, besides favoring sparse data representation. In some cases the result is equivalent to, or better than, networks with stacked unsupervised layers.

I only used a hidden layer, and in this case, there was an improvement in the validation data, but it did not translate into an improvement in the official competition score. Maybe using more than one hidden layer, and getting into the Deep Learning area would have helped, but I did not have time to test it.

The best neural network, in the original clean data, with 30 neurons in the hidden layer, and ReLu activation, reached the log loss of 0.3937.

Hashing trick

In a problem with many independent variables like this, storing weights for all of them in memory becomes an issue. Although this is a small dataset when we talk about big data, it already requires the use of the hashing trick.

This technique consists in hashing the features values and assigning weights to the buckets, instead of directly to each column. In this case, because we only have binary variables, it is fairly easy to use.

After the hashing is done, at each iteration of training we update the weights of the buckets. There is a risk of collision, but the higher the number of buckets, the lower the risk.

In practice there is no significant loss of performance for this reason, so this has become a widely used technique in problems involving high dimensionality. Both Vowpal Wabbit and Proximal FTRL use this technique.


It is almost impossible to score well in a Kaggle competition without putting many models together in an ensemble. Basically, if you have models that have similar accuracy but different errors, there is a good chance that if you join their predictions, you will have better performance.

Models in partitions of the data based on independent variables

The first attempt at ensembling was to create individual models for certain partitions of the dataset. For example, create 24 datasets, each containing only the samples at a given time, or create a dataset for each application ID, and train individual models in them. Then subdivide the test set in the same way and get predictions from these individual models.

In some cases, if partitions are made into groups that actually differ between them, performance can improve greatly.

After creating several models based on subgroups that seemed to be different between them, the final ensemble, taking the simple average of 8 models, reached a log loss of 0.3935.

Models based on random interactions of the variables

Another alternative to create ensembles is to use different variables for each model. After checking that some interactions improved the score using Proximal FTRL, I decided to create a script that tested the effect that an interaction would have on the ensemble score. Although interesting, one should be careful about overfitting.

Each model selected an interaction and tested whether the performance of an ensemble between it, and some model with other interactions, improved. This attempt generated 5 linear models with different combinations of variables, which together, taking the simple average, reached a log loss of 0.3888.

Putting together the best models

In the end, I put together several models that showed a good score, and reached the log loss of 0.3878 that guaranteed the 42nd position, among the top 3% solutions.

The difference between the log loss of this solution and the winner was 0.0087.

Other ideas and the winning solution

After the end of the competition it is common for the participants with the best positions to disclose their models and ideas that contributed to achieve a good performance.

In addition to the methods described here, two approaches stood out: attributes that took into account temporal factors, such as the number of visits a user made to a site or application in the last hour; number of visits on the same day; and using attributes related to the count of times the variable, or interactions between variables, appeared. Knowing how to use the temporal characteristics of the data seems to have favored the winners.

Using Machine Learning to Discover Illegal Ads in Russian, While Not Knowing Russian

The site Avito.Ru, founded in 2007, is one of the 5 largest Russian classifieds websites. It is used by both individuals and businesses to negotiate new and used products. One of the biggest challenges is maintaining the quality of content in the face of growth and the volume of new ads. To resolve this, the site decided to seek a solution based on the available data.

This was a competition ran on the Kaggle platform.

For readers with technical knowledge, the final solution code is available at this link (github).

Available Data

In the available database was the content of ads, attributes about the advertised product, category and sub-category, title and description. The variable that needed to be predicted was binary: whether the ad would be blocked or not by moderation. About 4 million ads with predictions given by human moderation were made available for training, and 1.5 million for testing.

In addition there was a variable indicating whether that ad had been blocked by an experienced moderator, which could make a difference in the human error factor.

Evaluation Metric

The metric used to evaluate the best solution was AvgPrecision @ K (Top at K). After assigning a score indicating the likelihood of a particular ad to be breaking the rules, they should be ordered so that the most likely were higher in the rankings. Once this was done, the system considered the top K ads and compared them with the actual true values to decide the correct hit rate of the model.

Data Transformation

My solution involved only the title and description of the ads. First the basics, put all the words in lowercase, remove stopwords and stem. One detail is that the ads were in Russian, so I do not know exactly which stopwords have other meanings (in English, “can” is a stopword, but it’s also used as a noun). Either way, these procedures improved the model’s performance. Also, I turned the documents into a numeric matrix where each line was an ad, and each column a word. For the element values I tested three variations:

– Binary: where the presence of the word in the ad was indicated with the number 1;
– Count: each value was the number of times the word appeared in the ad;
– Tf-Idf: each value was based on a formula that takes into account the frequency of the word in the ad, and also in relation to other ads. It assigns a higher value to rare words in the overall context, but which have a high frequency within the specific ad.

Among these alternatives, the one that demonstrated the best performance was the Tf-Idf. This is a technique widely used for text classification, and generally shows improvement in classification accuracy over the other options.

Despite making some attempts to clean the data, like removing the rarest words, the performance was negatively affected. What helped a little was removing numbers.


My focus was to create a simple solution, so my first trial was with Logistic Regression on all data. One of the factors to take into account was the distribution of positive and negative examples, which was not 50% for each. With this solution the accuracy was 91.8%.

After testing a few options available in scikit-learn (machine learning library in Python that I used), I found that using the “modified_huber” loss function created a more accurate model. This is a more robust function than log loss, since it applies quadratic penalty for small errors, and linear for large ones.

Another idea that helped a lot was to separate ads by category. Each category had a different proportion of positive and negative examples (Some with less than 1% and others with more than 30%). Applying the above algorithm to the transformed data in this way got 97.3% score. An improvement of 10.6%.

For the final solution, I also trained a Naive Bayes model that, despite assuming that the variables are independent, has a good performance for text classification. By averaging the solution of the two algorithms, I achieved the final score of 97.9%.

Differences to the winning solution

Comparing my solution with the solution of the winning team, there is a small difference of 0.8% in the score. But when we look at complexity, the winning solution used more than 30 models, between transformations and classification, for each example of the training set. In practice, as is normally the case in these competitions, it would not be worth implementing the winning solution. But this does not take out credit from the winners, and also the opportunity to learn.

The Best Free Courses To Start Learning Machine Learning

Data Science is a new and very promising area. Several sources indicate that we are in need of more data scientists than we can train. In this article I want to put courses that you can do to start your journey in the area.

All of them are offered by renowned professors and universities, and can be done for free over the internet. In addition, they have a very strong practical content, which will give you the conditions to start applying the demonstrated techniques in data of your interest.

The focus of these courses is on Machine Learning, which includes many of the most commonly used tools in data science.

Machine Learning – Coursera

This is the most popular introduction to machine learning online course. There are several types of data scientists, but much of the work requires knowledge of models that learn from data. In this course it is possible to learn superficially how several algorithms work, and how to use them to solve real problems.

The course begins by explaining what Machine Learning is, goes on to explain simple models, such as linear regression, and builds the foundation for more complex, widely used models such as Neural Networks and SVMs. In addition, the opportunity to implement parts of these algorithms helps us better understand how they work, and even though you will probably never have to implement an algorithm from scratch, it will help you to know its characteristics and limitations to use them in practical situations.

It begins with supervised learning, and in the end take a brief tour of unsupervised learning topics and anomaly detection. In addition, there are videos with techniques and best practices to evaluate and optimize a model, avoid errors such as overfitting or underfitting, and modifications to make it possible to use these algorithms in data that does not fit in memory (the famous big data).

Professor Andrew Ng is one of the founders of Coursera, as well as a professor at Stanford University and chief scientist at Baidu.

It is a fairly practical course, and requires only familiarity with linear algebra. Within the course there are videos that review concepts that will be important. The language used for programming assignments is Octave. There are videos within the course that teach the basics about the language, just what is needed to complete the course.

This course is offered on demand, meaning it can be taken at any time.

Statistical Learning – StanfordX

An introduction to Machine Learning from a statistical point of view. This course offered by two highly respected statisticians, Rob Tibshrani and Trevor Hastie, is very practical and brings a closer approach to the statistical concepts behind each model without so much focus on the computational part.

It begins with an overview of the Statistical Learning area, explains classification and regression problems, and basic tools for linear modeling. Then it takes us through methods of model evaluation, and techniques to optimize the model taking into account the generalization for new data. Finally, more advanced algorithms such as SVM and Random Forests are presented, as well as a brief passage through unsupervised learning methods.

This course uses the R language, which is widely used in the area of statistics. There are programming assignments, but they are geared towards the use of models, not implementation.

This course is usually offered once a year in mid-January.

Learning from Data – EdX

This course offered by Professor Yasser Abu-Mostafa applies a more computational approach but, unlike Andrew Ng’s course, there is a good deal of theory, which helps to understand how the models work more fully.

In the first classes the teacher explains to us the concept of machine learning, and the mathematics that underlies the theory that guarantees us the possibility of using an algorithm to learn a task through data. The theory is presented from two points of view: VC Dimension, and Bias-Variance Tradeoff.

After that, it presents some models such as logistic regression, neural networks and SVMs. Some techniques are demonstrated to optimize the models so that they are useful in a practical application.

Finally, we are introduced to Kernels, which are very important variable transformations, mainly due to the success of the SVMs, and an overview of areas of study in Machine Learning that can be followed after the course.

Although there is no confirmation about a next session, it is worth quoting and all materials (videos and tasks) are available online. It does not require a specific programming language, it is possible to complete assignments in any language.

Bonus – The Analytics Edge – EdX

This course, offered by MIT, Professor Dimitris Bertsimas and his team through the EdX platform focuses a lot on the application of machine learning methods using R. Much of the course is passed on examples of using the techniques, and the tasks are extensive, so that students can explore all the commands taught. Every week we have a new case study.

In addition to teaching methods of supervised and unsupervised learning, in the end they talk about optimization methods that, in addition to being interesting in themselves, are the basis of many machine learning algorithms.

One point that draws attention to this course and which, at the time of writing, is not offered in others, is that during the course one of the tasks is to participate in a competition, offered only to students of the course, on Kaggle. This is an opportunity to use the tools in a realistic case, having to create a solution using the knowledge acquired in the course.

This part of the competition is of utmost importance. From my own experience I say that nothing teaches more than having a dataset in front of you and having to decide alone what is the best direction to take to do the analysis.

How to choose a course?

If you ask me which one should you do, I’ll answer all of them. Although much of the material is the same, each gives you a different view of Machine Learning.

With Andrew Ng you have a quick and superficial, rather practical, presentation of the algorithms. In Statistical Learning, although very practical, there is a greater concern with classical statistical concepts, such as p-value and confidence intervals.

In Learning from Data, it is possible to understand the theory that underlies Machine Learning, the mathematical reason for an algorithm to be able to learn through the data.

If you are willing to do all, I recommend doing them in the order that they are arranged in the article.

How to Create a Simple Machine Learning Model to Predict Time Series

When we deal with time series prediction a widely used model is linear regression. Although simple, it has proved quite useful in real applications.

A very simple way to create a model for this case is to use the previous data of the variable of interest itself to predict the current one. It is possible to create models that seek to predict these series using other attributes, which in some cases will improve their accuracy.

In this article I want to demonstrate the simplest way, using a linear regression with the historical values of the variable that we are trying to predict. The code provided is in a format suitable for the reader to understand what is happening, so it may have parts that could be optimized in a production environment, but it was a choice, according to the educational goal of the article, to leave them like they are .

The full code (prime_english.py) and the data are available here.

Data Description

The data used correspond to the prime rate in Brazil. The prime rate is the bank interest rate for preferred customers, applied to clients with low risk of default in high value transactions. We will use the values of this rate in the last 6 months to predict the next.

We have monthly data from January 2005 to November 2014. They are originally released by the Central Bank of Brazil, but were obtained on the Quandl platform.

Important note: Do not use the information in this article as the basis for making any decisions, including financial, or investment decisions. This is an educational article and serves only to demonstrate the use of a machine learning tool for time series forecasting.

Models Used as Benchmarks

To compare the performance of linear regression in this problem, I will use two other valid methods for forecasting time series:

Last month value : the forecast for the next month is just the value of the variable in the last month.

Moving Average : The forecast for the next month is the average of the last 6 months.

Evaluation Metrics

We will use the Mean Absolute Percentage Error (MAPE). It is a metric widely used in the field of time series forecasting, and refers to the average percentage of errors in the forecasts, disregarding the direction (above or below the true value).

In addition to this error, we will also evaluate the Mean Absolute Error (MAE), which is the mean of the absolute values of the errors. This way we know how much we are deviating from the real values in the original units.

From a practical standpoint, in order to justify the use of linear regression instead of simpler methods, it should present an average error smaller than the error of the other options.

Python Modules

To run the script of this article you need the following Python modules: Numpy, Pandas and Scikit-learn. To reproduce the plot (optional), you need Matplotlib.

Defining the functions of evaluation metrics

For MAE we will use the implementation available in scikit-learn. Because it does not have a function to compute MAPE, we need to create it.

Loading and formatting the data

The data are in a CSV in a table format. After loading the CSV into memory using Pandas we need to organize the data to make the prediction. At the time of this writing, scikit-learn doesn’t get along well with Pandas, so in addition to preparing the data in the correct format to feed the model, we put them in numpy arrays.

The features matrix will have 6 columns, one for each month prior to the one we want to predict. The vector with the dependent variable will have the value to be predicted (next month).

For this we start with the seventh month available, which is number six in the loop because, in Python, the first element is indexed to zero.

Training the model and making predictions

Financial data usually changes regime often, so let’s train a new model every month. The first month to be forecast will be the 31st available after data transformation. So we have at least 30 examples to train the first model. In theory, the more data to train, the better. For each month we will store the predicted value by the three methods and the true value.

Evaluation of results

After the forecasts are completed, we turn the lists into numpy arrays again and compute the metrics.

Mean Absolute Percentage Error
MAPE Linear Regression 1.87042556874
MAPE Last Value Benchmark 2.76774390378
MAPE Moving Average Benchmark 7.90386089172

Mean Absolute Error
MAE Linear Regression 0.284087187881
MAE Last Value Benchmark 0.427831325301
MAE Moving Average Benchmark 1.19851405622

We see that the best model is linear regression, followed by using the value of the last month, and a rather poor result with the moving average.

Prime Rate Linear Regression

Suggestions to improve the model

We have created the simplest machine learning model possible, based only on the historical values ​​of the series. Below are some suggestions that can be implemented to possibly reduce the error.


Boosting is a technique that trains several models in the same data, the difference being that each model is trained in the residual errors of the previous ones. Although there is a risk of overfitting if good validation is not done, this technique is quite successful in practice.

Creating more features

We used only the values ​​of the last six months as independent variables. If we add more variables that are correlated with the dependent variable, we probably can improve the model. In addition, we can add features based on transformations, such as raising the values ​​squared, to capture non-linear relationships.

Use other models

Linear regression is not the only option in these cases. Models like neural networks, SVMs, and decision trees can perform better.

Será Que Seu Cliente Vai Te Pagar? Usando Machine Learning Para Prever Inadimplência

Uma das áreas mais perturbadoras para os empresários, sejam grandes ou pequenos, é a inadimplência de alguns clientes. Principalmente num cenário de crise, esta é uma parte que deve ser bem gerenciada pelos administradores do negócio, ou pode levar o mesmo à falência.

Imagine conseguir saber quais clientes vão deixar de pagar apenas observando o comportamento e as características de seus perfis. Tendo esta informação, o gestor pode ajustar seu risco, implementar ações e focar os seus esforços nos clientes com maior chance de causar problemas.

É aí que entra o Machine Learning. Neste artigo quero exemplificar a criação de um sistema simples para prever quais clientes de uma operadora de cartão de crédito deixarão de pagar a fatura do próximo mês.

Estes dados estão disponíveis originalmente no link: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Este é o link para o Jupyter Notebook com o código. Ele usa scikit-learn, numpy e XGBoost.

O Banco de Dados

Este banco de dados, em formato XLS, possui 30 mil registros, e já contém várias informações prontas para aplicarmos Machine Learning. Num caso real, é parte do trabalho do cientista de dados extrair as variáveis que possam ser importantes para a previsão. Aqui este trabalho está basicamente feito para nós.

Temos dados do perfil, como idade, sexo e escolaridade, e também dados comportamentais, como os registros de pagamentos dos últimos meses. Apesar de dados do perfil influenciarem, normalmente estes dados comportamentais que podem ser extraídos de bancos de dados transacionais acabam dando origem às variáveis mais importantes.

Como Saberemos se o Modelo Funciona?

Claro que, na prática, o mais importante é saber se este modelo diminui o prejuízo da empresa. Em alguns casos é possível extrair esta informação dos dados históricos e usar como métrica de avaliação do modelo mas, nestes dados, não temos essa possibilidade. Além disso, é importante avaliar o modelo de vários ângulos diferentes.

Neste caso, escolhi a métrica ROC AUC. Estou assumindo que o objetivo da empresa é saber quais são os clientes com maior chance de inadimplência para poder tomar atitudes específicas (como ligar cobrando ou enviar uma carta) que procurem aumentar a chance dele pagar.

Então, esta métrica, que vai de 0,5 a 1, é melhor quando os clientes que realmente são inadimplentes na amostra recebem uma pontuação prevista maior do que aqueles que pagaram no prazo. Ou seja, em vez de nos preocuparmos em acertar a probabilidade de um cliente pagar ou não, queremos apenas que os inadimplentes sejam rankeados com uma pontuação maior do que os adimplentes.

Fora isso, a decisão de como separar os dados entre treino e teste é muito importante. Originalmente estes são divididos entre usuários. Ou seja, treinamos num grupo de usuários diferentes daqueles que vamos prever, mas todos no mesmo período de tempo.

Na prática eu gosto de levar em conta a característica temporal da tarefa, pois esta maneira é mais próxima do modo como o modelo será usado em produção: separar um período anterior como treino e um posterior como teste. Desta maneira conseguiríamos saber, por exemplo, como o modelo reage a diferentes cenários econômicos e épocas do ano.

Seria possível transformar estes dados de maneira que estivessem em ordem cronológica, e prevermos outras variáveis, mas neste artigo vou me focar na parte de machine learning, então quero apenas deixar este comentário sobre este detalhe.

Precisei somar 2 aos valores de algumas variáveis categóricas porque o OneHotEncoder do scikit-learn não funciona com números negativos, mas isso não afetará nossa previsão.

É importante dividir os dados entre treino e teste antes de começarmos, e só usar o teste quando tivermos o modelo final. Toda a nossa validação será feita utilizando os dados originalmente separados para treino. Assim teremos um conjunto de dados não utilizados durante a criação do modelo que poderá nos dar uma estimativa confiável da performance.

Um Modelo Simples como Base

Para estabelecer uma base, e ver se modelos mais complexos apresentam alguma melhora significativa, vou criar um dos modelos mais simples: a regressão logística.

Para isso, preciso codificar as variáveis categóricas (como sexo, escolaridade, estado civil) com one hot. Isso simplesmente significa que cada valor da variável se tornará uma coluna com o valor igual a um, caso aquele valor esteja presente naquele exemplo, e zero, caso negativo.

Notem que usei o argumento “class_weight” como “balanced”. Como temos menos exemplos da classe positiva, o modelo pode dar mais importância à classe negativa, mas este argumento faz com que o modelo aplique uma penalidade proporcional ao número de exemplos da classe aos erros, tornando-as equivalentes neste sentido.

A regressão logística fica com AUC por volta de 0,73.

Aplicando Modelos mais Complexos

A regressão logística é um modelo bem simples, capaz de capturar padrões lineares. Mas e se tivermos interações não-lineares entre nossas variáveis? Um exemplo seria pessoas abaixo de 25 anos e solteiras oferecerem um risco diferente de inadimplência. Se quisermos que uma regressão logística capture este padrão, precisamos criar uma variável específica.

Uma outra maneira de capturar estes padrões é usar um modelo não-linear, com maior capacidade de capturar padrões complexos. Neste caso, escolhi um modelo que, em geral, sem muitos ajustes, já dá uma boa performance: a Random Forest.

A Random Forest é um conjunto de árvores de decisão criadas em partes aleatórias dos dados e das variáveis. Cada árvore é um modelo fraco, mas quando calculamos a média das previsões, elas se tornam um modelo poderoso.

Uma das vantagens de árvores é que, em casos de variáveis categóricas que não possuem alta cardinalidade (muitos valores diferentes), elas conseguem aproximar padrões sem que precisemos transformar para one-hot, podemos usar o formato ordinal. Então, vamos usar o formato original destas variáveis.

A Random Forest com os parâmetros padrões do Scikit-learn atingiu AUC 0,763. Após ajustar os parâmetros, consegui chegar a um AUC de 0,777. Parece uma melhora pequena, mas quando falamos de milhares de exemplos, isso faz diferença.

É Possível Melhorar Mais?

Um dos modelos mais poderosos de conjuntos de árvores é o Gradient Boosted Trees. Em vez de criar várias árvores aleatoriamente, este modelo cria árvores dando mais peso a exemplos nos quais o conjunto atual comete mais erros.

Ele é mais complicado de usar do que a Random Forest, então temos mais parâmetros para ajustar. Mas quando bem utilizado, costuma apresentar uma performance melhor.

A implementação que utilizei foi o XGBoost, que junto com o LightGBM, são ferramentas de código aberto, que rodam em paralelo, escalam para grandes volumes de dados, e oferecem uma performance muito boa.

Este modelo, com os parâmetros originais, atingiu o AUC de 0,775. Após ajustar os parâmetros o AUC foi para 0,781.

Verificando os Resultados nos Dados de Teste

Agora que temos nosso modelo final, o Gradient Boosted Trees com os parâmetros que atingiram o melhor resultado em nossa validação cruzada, vamos treinar este modelo em todos os nossos dados de treino e ver qual a performance nos dados de teste, que não utilizamos até agora.

Caso os resultados sejam muito diferentes, é sinal que fizemos algo errado durante a construção do modelo, ou que os padrões presentes no treino não são tão fortes no teste.

No teste temos o AUC 0,789, bastante próximo do que vimos na validação cruzada, o que é um forte indicador que nosso modelo vai funcionar para novos dados, desde que eles sejam distribuídos da mesma maneira como separamos treino e teste.

Nenhum modelo é perfeito, mas o mais interessante é que, se olharmos os 100 exemplos do teste com maior pontuação indicando inadimplência, 83 deles realmente não fizeram o pagamento. Se olharmos os 100 exemplos com menor pontuação, apenas 3 foram inadimplentes.

Os Próximos Passos

Neste artigo iniciamos com um modelo simples de regressão logística, e avançamos até o modelo mais complexo, que ofereceu uma melhora significativa. Estes modelos poderiam ser usados em produção sem grandes esforços de engenharia.

Para melhorar, seria interessante ter acesso a mais usuários e a outros bancos de dados dos quais pudéssemos extrair mais variáveis que podem ser importantes. Além disso, se os recursos computacionais disponíveis suportarem, podemos criar vários modelos e usar o conjunto de suas previsões, atingindo uma performance ainda melhor.

Mas este é um ponto de partida.

Como Criar um Sistema de Recomendação de Produtos Usando Machine Learning

Imagine que para cada usuário registrado em seu site você pudesse recomendar produtos diferentes, personalizados para os gostos do cliente. Isso é possível usando sistemas de recomendação automática baseados em machine learning.

Esta é uma das aplicações mais famosas de machine learning em comércio eletrônico. Quem nunca visitou o site de uma loja e dentro da página havia “outros produtos que podem te interessar”? Várias empresas já adotam este tipo de sistema, inclusive gigantes como a Amazon e Netflix.

Os métodos descritos neste artigo podem ser aplicados a qualquer produto. Aqui vou demonstrar o uso com um banco de dados de usuários de uma comunidade de leitores. O desafio é, baseado em notas de 0 a 10 dadas a livros, recomendar novos livros que o usuário possa gostar.

Formato dos Dados e Tarefa

Para criar o sistema de recomendação, basta termos os dados no seguinte formato:

Usuário – Produto – Nota

Ou seja, para cada produto que o usuário deu uma nota, teremos uma linha em nosso arquivo. Se o seu site não possui um sistema de avaliação de produtos, também é possível substituir a nota pelo número um, caso o cliente tenha comprado o produto, e zero, em caso negativo.

Nós tentaremos prever a nota que um usuário dará a um livro que ele ainda não avaliou (e provavelmente não leu). Na prática, baseado nas notas dadas aos novos livros, podemos recomendar a ele os livros com maior nota, pois estes são os livros que nosso modelo sugere que despertam o interesse deste leitor.

Fiz uma breve limpeza nos dados, e tanto eles quanto o código se encontram neste link: Github Materiais Recomendação
Os dados originais foram retirados deste site: Book-Crossing Dataset


Eu utilizarei a biblioteca Surprise, em Python, que possui algoritmos de recomendação que podemos treinar usando nossos dados. Ela não é uma biblioteca muito extensa, mas possui tudo o que precisamos para fazer nosso sistema.

Para avaliar o modelo usarei a função nativa da biblioteca para dividir os exemplos em 3 partes e fazer validação cruzada.

O Primeiro Modelo

O primeiro modelo que faremos é muito simples, baseado na nota geral dos produtos, diferença da nota média do produto para a nota geral e diferença da nota média do usuário para a nota geral.

Então, por exemplo, imagine que a média de todas as notas, de todos os produtos de seu site seja 3. Esta é o que chamamos de média geral.

Agora imagine que a nota do livro que queremos recomendar ao usuário seja 4. Para obtermos a diferença da média geral, subtraímos 3 de 4 (4 – 3), e temos que o valor da diferença da nota média do produto para a nota geral é 1. Ou seja, este livro é avaliado como melhor do que a média dos livros do site.

O último componente de nossa fórmula envolve a média de notas que o usuário dá aos livros. Isso leva em consideração a personalidade mais seletiva ou não de alguns usuários. Em nosso exemplo, a média da nota do usuário é 3, significando que o usuário é mais exigente que a média. Subtraímos a média geral deste valor (3-3), e obtemos a diferença da nota média do usuário para a nota geral, que é 0.

A fórmula que usamos para prever a nota que este usuário dará a este produto é a seguinte:

Avaliação = média geral + diferença da nota média do produto para a nota geral + diferença da nota média do usuário para a nota geral

Ou seja, neste caso 3 + 1 + 0 = 4.

No código abaixo usei o modelo BaselineOnly, que calcula os coeficientes de cada usuário e produto, além da média geral, de acordo com nossos dados de treino, e armazena para podermos usar em novos produtos.

Para medir o erro, usei a Raiz Quadrada do Erro Médio Quadrado, que basicamente mostra em média quanto a nota prevista desvia da nota real.

O erro para este modelo foi de: 1,65347. Este é um erro baixo, se pensarmos que as notas vão de 1 a 10.

Testando um Modelo mais Complexo

Agora vou testar um modelo mais avançado. Em vez dos três números usados pelo modelo acima, este vai tentar encontrar representações mais complexas para cada usuário e produto. Isso dá maior capacidade do modelo capturar detalhes, e a ideia é que capturando estes detalhes o modelo possa estimar com menor erro a nota dada a um novo produto.

O erro para este modelo foi de 1,74865. Apesar de ser um erro baixo, não é melhor que nosso modelo mais simples.

Nem sempre um modelo ou algoritmo mais avançado, complexo, significa uma melhora. E em alguns casos, a melhora é tão pequena que não vale a pena. Por isso é importante que o cientista de dados saiba o que está fazendo, e conheça o funcionamento dos modelos, para saber qual é a melhor alternativa para o banco de dados e a tarefa em questão, em vez de apenas aplicar o que é mais popular ou mais avançado.

Eu ainda tentei encontrar parâmetros para otimizar este modelo, pensando que isso pudesse ajudá-lo a superar o modelo mais simples, mas não obtive êxito. Isso fortalece a hipótese que o modelo mais simples é melhor neste caso.


Esta foi uma demonstração bem simples e rápida de como criar um modelo de recomendação que poderia ser usado em um site de comércio eletrônico. Existem vários outros passos, como a verificação mais cautelosa dos dados, definições da melhor maneira de fazer o processo de modelagem e a otimização e testes dos modelos, que devem ser feitas para garantir que o modelo tenha uma boa performance e seja robusto, mas elas fogem ao escopo deste artigo.