Autor: Mario Filho

Mario is a machine learning expert. Kaggle Grandmaster. Five-times prize winner in machine learning competitions. Kaggle profile:https://www.kaggle.com/mariofilho. GitHub

Can a Machine Learning Model Predict the SP500 by Looking at Candlesticks?

Candlestick chart patterns are one of the most widely known techniques that claim to “predict” the market direction inside technical analysis circles.

The development of this technique goes back to 18th century Japan, and it’s attributed to a Japanese rice trader.

It consists of finding patterns based on charts made of the above figure with prices over a period of time. There are many patterns, old and new, available on the internet. But the question is, do they really work as described?

Let’s use machine learning to find out.

There are multiple academic papers, books and websites testing these patterns in different instruments with statistical methods. Results are mixed.

If the patterns can really predict the market, training a machine learning model using them as features should make it able to predict the returns of a traded financial instrument.

Let’s dive into the experiment!


Getting the Data

I found this very nice module called pandas-finance that makes getting information from Yahoo Finance very easy and outputs it directly to a Pandas Dataframe.

Stock market price data is very noisy.

If we have a pattern that has an edge, it doesn’t mean that 100% of the time the price will go straight into the predicted direction. So let’s consider both returns one and three days after we see the candlestick pattern.

We have price data since 1990. We computed the returns over the Adjusted Close. Last 5 rows are removed, as we will not have the return data for these.

Computing the Features

Here I use the wonderful python version of TA-LIB. It contains a lot of functions to compute technical analysis indicators. Included are functions to compute candlestick patterns.

All the candlestick functions begin with “CDL”, so we will find all the available functions, pass the price data, and then store the resulting signals in a DataFrame.

For each day we can get generally -100, 100 or 0 as values. Indicating if it’s a bearish/bullish signal, or the pattern is not present. Some patterns can get -200/200 when they check for a “confirmation”. Looking at the code, this doesn’t seem to leak information from the future.

Besides this, there are helper functions for us to compute and plot the error. The metric chosen is Root-Mean-Square Error (RMSE).

Let’s Put a Random Forest to Work.

If there is a pattern in the data, Random Forest can very likely find it. As it’s a model that doesn’t need much tuning to work, here I just set it to create 10.000 trees, which are a lot. Enough to curb the noise and find any existing patterns.

Implementation from scikit-learn.

The training set consists of daily price data from 1990 to 2008. And the validation will be 2009 and forward. It’s important to have the validation split by time here, as we literally want to use this model in future, unseen years.

The specific date (2009) was chosen so that we have a long period of time in validation to account for noise. We are talking about a more than 100-years-old technique, so a decade should be fine.

As baselines we can take two sources:

  • predict every return as zero (base_zero)
  • predict the average return of training for each day in the test period (base_avg)

The full code can be found here.

If the model works, it will have a smaller error than these two approaches.


As the errors from the three solutions are very close, here we look at the differences from the zero prediction baseline. Negative values means the approach does a better job than predicting everything as zero.

Returns One Day After

Returns Three Days After

The Random Forest was worse than both the average and the zero prediction. This means the model actually learned something from the data, but it didn’t generalize for the validation period.

Basically, the patterns that worked on the training period didn’t seem to keep working on future years.

But let’s give it another chance…

Opportunities Don’t Show Up Every Day

About 30% of our data consists of days without any patterns showing up. So an idea is: let’s only use this model when we have a pattern. This is fine, as in real life we would know if we had a pattern today or not, and could use the model accordingly.

One of the reasons to try this idea is that we can reduce some of the noise, and help the model identify the characteristics of the patterns better, leading to a predictive edge.

So here let’s select, both on training and validation, only the days when we see any pattern showing up.

One day After

Three days After

Still bad. The model can’t beat a simple average. The model can be very powerful, but if the data doesn’t have signal, it will simply not work.

The Answer is…


Can we claim candlestick patterns don’t work at all? Again, no. We can say it’s very likely they don’t work consistently on SP500, daily data.

Before we reach a more general conclusion there are open avenues to explore:

  • What about other instruments? Can it work on low volume or low liquidity stocks?
  • What happens in the lower (or higher) time periods: seconds to months?
  • Would individual models for specific patterns work?

Just remember that we need to be careful: if we test enough ideas, some will work just by chance!

There is More to a Model than Predictions

The model predictions were bad, but we can look further into it to get more research ideas. Let’s see the feature importance for each pattern.

First, the native feature importance from Scikit-learn.

Some patterns are more important. This could mean at least two lines of work :

  • The model focused too much on patterns that did well on training, but stopped working (or were just a data fluke). In this case, we should discard the top features.
  • There are strong patterns in the data, that we can explore individually and may give an edge. In this case, more attention should go to the top.

A very good way to peek inside a model is using Shapley values.

A richer visualization changes the importance ranking slightly. Here we can see how the points with specific feature values contribute to the model output.

Higher SHAP value means it contributes to a positive return.

The Spinning Top candle appears as the most important feature. According to Wikipedia, it can be a signal of a trend reversion.

Let’s see how it relates to the model outputs

In this plot, we see a weak suggestion that, when the bearish version of this pattern appears (close less than open price), the returns on the third day may be higher than today. And the contrary when the bullish version happens.

This may or may not be an interesting investigation line.

The most important part is that we saw how we can use machine learning to validate these patterns and suggest areas for future research.

Can Gradient Boosting Learn Simple Arithmetic?

During a technical meeting a few weeks ago, we had a discussion about feature interactions, and how far we have to go with them so that we can capture possible relationships with our targets.

Should we create (and select) arithmetic interactions between our features?

A few years ago I remember visiting a website that showed how different models approximated these simple operations. It went from linear models to a complex Random Forest.

One of the most powerful and deployed complex model we have today is Gradient Boosted Decision Trees. We have LightGBM, XGBoost, CatBoost, SKLearn GBM, etc. Can this model find these interactions by itself?

As a rule of thumb, that I heard from a fellow Kaggle Grandmaster years ago, GBMs can approximate these interactions, but if they are very strong, we should specifically add them as another column in our input matrix.

It’s time to upgrade this rule, do an experiment and shed some light on how effective this class of models really are when dealing with arithmetic interactions.

What Do The Interactions Look Like?

Let’s use simple two-variable interactions for this experiment, and visually inspect how the model does. Here I want to try the four operations: addition, subtraction, multiplication and division.

This is what they look like on a -1 to 1 grid.


As expected, Addition and Subtraction are linear. Multiplication and Division have non-linear parts. These will serve as a reference.

Bring on XGBoost

Because of its popularity and mechanism close to the original implementation of GBM, I chose XGBoost. There should not be many differences to the results using other implementations.

I generated a dataset with 10.000 numbers, that covers the grid we plotted above. As we are studying the behavior of the model here, I didn’t care about a validation split. After all, if the model can’t capture the pattern on training, test is doomed anyway.

Zero Noise

Let’s start with an “easy” scenario. Zero noise, Y is exactly the operation applied on X1 and X2. E.g.: X1 + X2 = Y.

Addition and Subtraction have some wiggles, because we are approximating the interaction, not perfectly modeling it, but it seems fine. I would say the model did a good job.

Now… what the heck is going on with Multiplication and Division?

The model was not able to capture the pattern. Predictions are constant. I tried the following, without success:

  • Changing the range
  • Multiplying Y by a big constant
  • Using XGBRegressor implementation vs xgb.train
  • Changing hyperparameters

We should find why this happens. If anyone knows, please comment. I opened an issue on GitHub.

ANSWER: Jiaming Yuan and Vadim Khotilovich from the XGB team investigated the issue, and Vadim wrote: “it’s not a bug. But it’s a nice example of data with “perfect symmetry” with unstable balance. With such perfect dataset, when the algorithm is looking for a split, say in variable X1, the sums of residuals at each x1 location of it WRT X2 are always zero, thus it cannot find any split and is only able to approximate the total average. Any random disturbance, e.g., some noise or subsample=0.8 would help to kick it off the equilibrium and to start learning.”

Thank you for the answer. Now we can keep this in mind if the model is acting stubbornly.

Is There Life Without Noise?

It’s very, very hard for us to have a dataset with noise (that we know of or not). So let’s see how the model performs when we add a bit of random normal noise.

Gaussian Noise: Mean 0, StDev 0.5

With this amount of noise we see that the model can approximate the interactions, even division and multiplication. Back to our topic, although wiggly, we see that the model indeed can capture such patterns.

Gaussian Noise: Mean 0, StDev 1

Even more wiggly! But again, it can approximate the interactions.

Updated Rule of Thumb

So, what all this means for our day-to-day modeling? Can GBMs do simple arithmetic? The answer is YES.

The original comment still applies. If you want to really capture all the performance, and you know the interactions really apply, you should EXPLICITLY create them features. This way the model doesn’t have to struggle so much to find the splits and thread through noise.

Now, if you really don’t want to create the interactions, here are some ideas that may help the model approximate it better:

  • Adding more examples
  • Tuning the hyperparameters
  • Use a model that is more adequate to the known form of the interaction

How To Use Neural Networks to Forecast Multiple Steps of a Time Series

Time series are wonderful. I love them. They are everywhere. And we even get to brag about being able to predict the future! Maybe I should put it on my Tinder profile?

As a follow-up to the article on predicting multiple time-series, I receive lots of messages asking about prediction for more than a single step.

A step can be any period of time: a day, a week, a minute, an year… So this is called multi-step forecasting.

I want to show you how to do it with neural networks.

This article is a simple tutorial on how to set-up a basic code/model structure to get you up and running. Feel free to add improvement suggestions on the comments!

The data we will use is the same sales data, but now we will try to predict 3 weeks in advance. Each week can be considered a “step”.

I will use steps and weeks interchangeably in the article to get you used to the idea.

Keep in mind that this method can be used to predict more steps. I chose 3 only because it’s a tutorial. In a real case you should check with stakeholders what is the number of steps most useful to them.

You will need scikit-learn, Keras, pandas and numpy. Get the dataset here.

Let’s load the data

We have data for 811 products and 52 weeks. To make the processing easier, let’s create a column with only the numeric part of the product code.

time series pandas

We see that, although we have 811 products, the maximum product code is 819. We need to remember this when setting up the one-hot encoding for this information.

Preparing Inputs and Outputs

Here we are going to use the previous 7 weeks to predict the next 3. So let’s say we are in the last day of January and want to predict the first 3 weeks of February. We get the previous 7 weeks spanning January and part of December.

These numbers are, again, took out of thin air to demonstrate the method. You could use 6, 12 weeks… Whatever number makes sense. Oh, and it doesn’t need to be weeks. Steps can be any unit of time.

We need to transform our data. Every example in the training set (and, here, in our test set) need to have as inputs the previous 7 weeks sales values, and as outputs, the next 3 weeks.

Think about every row having 800+ zeros, with a one in the position corresponding to the respective product code, and 7 other values, being the previous steps values.

For completion, you will find that we call these “previous steps values”, lags. Because it’s a lagged version of the time series.

But, Mario, what do I do in production, when I don’t have the next 3 weeks? You will have only your 7 previous steps and the model will generate a prediction for the next 3. The only difference is that you will need to wait 3 weeks to know how well your model predicted.

Here we use historical data to simulate this scenario.

Setting up Training and Test

As any supervised learning task, we need a training and a test (validation) set. Here I am doing this simple split, taking about half of the weeks for training, and the other half for testing, so that we have about the same number of data points for each.

For each product, we separate FUTURE steps for testing and PAST steps for training. If we did a simple random split we would mix past and future, and it would not be a reliable simulation of what we will find in production. After all, this is not a tutorial about creating time machines!

Preprocessing and Modeling

Neural networks are very sensitive to the scale of the data, so I decided to use the RobustScaler from scikit-learn, as we have 800+ one-hot columns. It doesn’t mean it’s the best option, feel free to try more sophisticated strategies.

Keras is a fantastic library for rapid development of neural networks. I chose a simple single hidden layer network, but this could be a sequential net, like LSTMs, a Convolutional network, which is showing good results in time-series problems across papers. Or even both! Try different architectures and share the results with us.

Now we just fit the model and get the results. The model is clearly underfitting (training error is higher than validation).

Captura de Tela 2019-01-16 às 11.54.12

To have a better idea of how well the model is predicting, let’s calculate the root mean squared error over the log of the values. This is a smooth approximation of the mean absolute percentage error.

We get 29-30% for each of the steps. This is actually good, given that 90% of our output values are less than 30.

Final Comments

Does it mean we beat the solution from the other article? We can’t really tell, as we have a different test set, but if anyone wants to check, comment below!

There are many improvements: feature engineering, scaling, neural network architecture, hyperparameter tuning, even ensembling. But this should give you an idea on how to model this type of problem.

This is not the only method for multiple step forecasting, and it will not be the best for all problems. Keep it as another tool in your set.

Get a notebook with the full code here.

How To Predict Multiple Time Series With Scikit-Learn (With a Sales Forecasting Example)

You got a lot of time series and want to predict the next step (or steps). What should you do now? Train a model for each series? Is there a way to fit a model for all the series together? Which is better?

I have seen many data scientists think about approaching this problem by creating a single model for each product. Although this is one of the possible solutions, it’s not likely to be the best.

Here I will demonstrate how to train a single model to predict multiple time series at the same time. This technique usually creates powerful models that help teams win machine learning competitions and can be used in your project.

How to Win the Machine Learning Competition of Australia’s Largest Telecommunications Company in Just 19 Days

Telstra, Australia’s largest telecommunications and media company, sponsored a contest at Kaggle seeking data scientists that would stand out as part of a recruiting process to join their Big Data team.

Service log data from their network were made available, and the task was to create a model that could predict the severity of a failure at a particular location and time, using the data from these reports.

In total, 974 competitors from around the world participated, and this victory put me in 12th place (of 50,000+ users) in the global Kaggle ranking.

About the Data

There were about 7,400 training examples and 11,200 for testing. A small amount, which requires special attention during the creation and validation of the model.

Each example had an ID, which represented a pair between event location and time. In addition, other files were provided that contained, for each example: information on signals obtained by the current severity monitoring system when processing the logs, type of event, and severity of messages sent by it at that time.

The severity of the problem had three categories: 0, meaning that there were no problems; 1, meaning that some problems were detected in the network; And 2, representing many reported problems. Therefore, it was a classification with multiple classes.


Although there is a temporal and spatial dependence between the data, since they are from different locations, at different points in time, the data was randomly split by the organizers. It was possible to know the location, but not the time when the record was obtained.

So a simple cross-validation was the best environment I found to test my model. Cross-validation is simply the procedure of dividing your data into K parts, and doing a training cycle, using K – 1 to train and evaluating in the division left out.

In this case I used 5 divisions, which provided a fairly reliable estimate of the performance on the unseen test data revealed after the end of the competition.

Feature Engineering

In total I had 242 variables.

“Magic” Variable

There was a pattern in the data, which would later be called the “magic variable” in the forums, which allowed us to exploit what seemed to be a mistake in data preparation to achieve a good improvement in model accuracy.

Although the training and test files have random IDs, the ordering of the records of the other files had predictive value. With this, you could use the ID of these files themselves, or create variables related to it, to exploit the leak.

Some people criticized the fact that the best solutions used this error to get a better score in the competition, but I think it’s important to be able to find this type of error in data, because the only thing that would change out of a competition is that it should be corrected, rather than exploited.

Therefore, it is always important to look for errors in the data. There are several types of information leaks, some very subtle, that can make your model look better than it actually is.


There was a variable indicating the location of that record. This, in theory, is a categorical variable, which means it has no natural order, but I still used it as an ordinal, instead of creating a column for each value, for two reasons: trees can capture the patterns of categorical variables even using this format (in some cases it is better to use one hot), and there could be some pattern of proximity between sites.

For example, it may be that location 201 and 202 were under the same weather conditions, which could influence the occurrence of problems in these two stations, so perhaps there was an implicit cluster there.

Log Variables

The service logs had what seemed to be a line for each warning signal, and a value indicating how many times the warning was triggered at that time. The organizers have not made it clear whether this was the meaning, but it is the most plausible explanation.

With this, I created a column for each of these signals, and the value for each example was the number of warnings.

These columns were quite sparse (more zeros than other values), so most of them ended up being useless, but some had predictive value.

Final Solution

After being satisfied with the performance of my best individual model, I decided to create an ensemble. That is, create several models that capture different patterns in the data and complement each other.

Among the 15 models that composed the final solution, two stand out: Gradient Boosted Trees (in the implementation of XGBoost) and Neural Networks (using Keras).

Gradient Boosted Trees – XGBoost

Anyone who is used to Kaggle knows that most of the competition solutions involve Boosted Trees. This model is quite powerful by itself, but the parallel implementation in the XGBoost package has proven to be very good at solving machine learning tasks in general.

In this case, my best “individual” model was one of these and would have finished in 18th place. There was still a lot that could be improved on it, but since I had only a week to go, I knew that without an ensemble it would be a lot harder to win, so I decided to end the single model optimization at this point.

Neural Networks

A model that finds good solutions, but using a different strategy than decision trees, is a neural network. Although it did not demonstrate a performance comparable to XGBoost, it was a great addition to it in the ensemble.

I used the Keras library, which is fantastic for creating neural networks in Python.

I tried to normalize the data with all the transformers of scikit-learn, but the best was the standard (StandardScaler), which subtracts the mean and divides by the standard deviation. In addition, I used two layers of neurons and the optimizer Adam. To find a good architecture, dropout coefficient and parameters in general, I did a random search.

Neural networks seem pretty sensitive to “everything,” so I believe there are different combinations of architecture, data normalization, and variable encoding that would give equal or better results.

One of the reasons it did not get better was because I did not create a specific set of features for the neural network. I used the same set of variables from the trees. If this was my main model, I would have done it differently, but since it was just a component of the ensemble, this way was enough.


This was a brief summary of 3 weeks of work in this competition.

My initial plan was to create just one model and see how far I could go, but seeing that the model was good, I decided to try to reach my first individual victory.

Using Machine Learning to Predict the First Destination of 60,000 AirBnB Users

AirBnB is a technology company that offers a virtual environment where users can book accommodations to stay in various places in the world, and also advertise their properties to the travelers’ community.

Trying to find candidates to compose their team of data scientists, they decided to sponsor a competition in which the goal was to predict which will be the first country a new user will make their reservation.

The most interesting part for me of this competition is that it was possible to reach a good position (Top 7%) with only one model, which could be used in production right away.

About the Data

Anonymous data about the user profile, such as age, gender, account creation date, language, and the channel used to reach the site were made available. We had data since 2010.

In addition, data about user sessions, identified by the id, that described the action performed (click, a message to the owner of the property, visualization of search results, etc.) and how many seconds passed between that action and the previous one. These data existed only since January 2014.

There were two other files, with data on the characteristics of the countries, but I found no use for them.

The metric used was NDCG. It basically goes from 0 to 1 and measures the relevancy of the model results sorted by their predicted score rank. More information on this link .

A user could choose between 12 countries, but almost 90% of them went to the United States or did not make a reservation.

It soon became clear, due to the metrics used, that it would be better to focus on modeling whether or not the user makes a reservation, leaving the destination in the background.


The data provided for training and validation referred to users who created their accounts before July 1, 2014. And the test data on the leaderboard was three months after this date.

Several participants used cross-validation, randomly shuffling the examples, but because of the time to run the model during development, and also because the data characteristics depended on time, I decided to use May and June 2014 data as validation.

Some characteristics of the data, such as the proportion of users who had session data, changed over time, so I decided to use a period quite close to the test to validate. And this period proved itself quite robust.


User Profile

I used the standard user profile variables already described above. Since I planned to use a model based on decision trees, I made each categorical variable into ordinal, as these models are able to find patterns even though these variables do not have a real order.

In addition, I tried to compute the amount of time a user spent between sessions, and how many different types of actions he performed on the site.


I extracted basic date information, creating individual columns for the day, month, year, and day of the week. This helps the model capture seasonal effects of the time series.

User Behavior

This is an essential part of most e-commerce models. In this specific case, I’ve created a column with the amount of time the user spent on each action that he performed on the site. Also, I did the same procedure to calculate how many times the user executed a particular action.

Machine Learning Model

I used a Random Forest with 300 trees that was good enough to win a place in the Top 7% (91 of 1463).

In total, I had about 1100 columns, and most of them quite sparse (most of the values were zero). Some people say that models based on decision trees do not do well with this data format, but my experience suggests that it is a matter of finding the right parameters.

Unfortunately, ​I did not have time to train a good model of Gradient Boosted Trees, which would certainly have a better performance, and because of the small difference in​ scores between Random Forest and the top of the leaderboard, it would almost certainly be good enough to finish in the Top 10.

Classifying Customer Visits to Walmart in 37 Categories Using Machine Learning

Defining a customer’s intent when visiting a store, whether real or virtual, can help a company deliver a personalized experience. That’s why it’s important to use tools that can help categorize and identify these visits.

Walmart has provided anonymous trip data for its customers to some of its stores. Among the data were the items purchased and their respective quantities, as well as descriptive information about the products and the day of the week. The task was to use this information to categorize the intent of that client’s visit.

Despite not giving us a reference table for each code, the organizers said that some examples of categories of visits are: weekly purchase of food, gifts for a celebration, or purchase of clothes for the new season.


There were about 95,000 trips for training, and 95,000 for testing. The file contained basically one line for each product purchased during a visit, and it was the competitor’s responsibility to properly transform it to merge the data for each trip.

The following information defined the products: the UPC code, with more than 100 thousand different types; The Fineline code, which is a refined category created by Walmart, with about 5,000 types; And the Department, which contained about 60 different types.

In addition, the ScanCount variable defined how many products of that type were purchased, and if it was a negative number, the customer was returning products.

Transforming the Variables

In order to train a model it was necessary, as a minimum, to group the basic information by trip. Several reports on the forum talked about transformations resulting in thousands of variables.

As I only had 7 days to work in this competition, my goal was to create a simple and compact model, with the fewest possible variables, but with a good performance to position me among the top 10%.

Basic Variables

Examples of variables based on statistics of each trip are: the average amount of each product purchased during the trip, indication if there was a return of product, sum of the quantity of all products purchased and the department with the highest quantity of items purchased.

Counts and Proportions

I decided to add the amount of products purchased by department, and use both the count of each department and the proportion that this department occupied in this purchase. This created about 120 variables.

Singular Value Decomposition + TF-IDF

To try to use Fineline and UPC without increasing the number of variables, I decided to add the quantity of products in each of them and make two transformations.

First, TF-IDF, which replaces the quantity by a weight relative to the proportion of an item on that trip, and how often is the presence of this item on other trips.

Then I applied the SVD, which tries to find the components that have the most variation in the data.

These transformations are usually used with text, applied to the word count in each document, and known as Latent Semantic Analysis in that context.

In addition to reducing the size of the data, it is expected to remove much of the noise, and find structural categories to which the trip belongs.

In practice this helped a lot with Fineline, but it did not help much with UPC.

Logistic Regression L1 and L2

Another way to reduce size is to train a simpler model and use its predictions as a variable in the main model.

For this, I trained two logistic regressions: one with L2 penalty and another with L1 penalty. This generated about 70 variables, 37 for each regression, with the probability of an example belonging to each class.


Gradient Boosted Trees – XGBoost

Most of my time was focused on building a good model with XGBoost. This model was already good enough to stay in the top 10%.

Neural Networks

To complement XGBoost, and try a better position, I decided to train a neural network on the same variables. It had 2 hidden layers and dropout.

Other competitors reported good results with neural networks, but I did not spend much time tinkering with them. The intention was only to achieve a slight improvement over the result of XGBoost.

Possible Results and Improvements

The final solution was a simple ensemble between a neural network and XGBoost, which was enough to secure a position among the top 7%. The solution took about 8 hours of work.

Using only the Gradient Boosted Trees model, without ensemble, it was possible to stay in the Top 9%. This model had only 200 variables. Most of the better models had more than 5000 variables. Certainly increasing the number of variables could make this model position above the Top 5%, but it would take a lot of time to train.

Other possible improvements would be: to increase the ensemble, to create a set of variables optimized for the neural network, to use directly the variables with Fineline counts, to better tune the parameters of the logistic regressions and the transformations.

How We Won the Caterpillar Machine Learning Competition at Kaggle

Caterpillar is a manufacturer of industrial equipment such as tractors and engines. To maintain their operations, they need to buy tubes of various specifications, from several different suppliers, to use in their production line. Each supplier and product has a different pricing model.

The task in this competition was to create a model that would be able to price the tubes using historical supplier data and product characteristics.

I had the pleasure of competing on the team that won this competition, surpassing more than 1300 teams of data scientists from around the world. In this article I want to describe the most important parts about creating the solution that secured this victory.

Having a better familiarity with my part of the solution, a good part of this article refers to it, but I want to make it clear that the winning solution was created using different models developed individually by each member. None of us would have managed to win alone, and all were essential for the team to reach first place.


Several files with characteristics of the tubes and the suppliers were available. There are two basic categories of pricing: fixed price and variable price according to quantity.

There were about 30,000 lines for training and 30,000 for testing. Something important to consider is the fact that several lines referred to the same tubes, changing only the minimum amount needed to purchase to get that price.

Among the available variables were: quantity, tube ID, date of purchase, annual use forecast, diameter and length.


This is one of the most important parts of any data mining task. It is essential to create a validation environment that has the same or very close characteristics of the production environment. In our case, the test data.

This means that it would not be enough to simply shuffle the lines and distribute them in folds as in “standard” cross-validation. In this case, I decided to create folds based on the tube IDs. In cross-validation, instead of distributing rows randomly, each fold contained all rows of a given tube.

Although it was a time series, training and test data were not split between past and future, so taking the date into account to define the validation data did not bring benefits.

This validation scheme proved to be quite robust and close to both the public leaderboard and the evaluation in the test data (private leaderboard).

Feature Engineering

After a robust validation environment, the most important part is finding data characteristics that can be transformed into variables to feed the model.

I will present transformations that, in general, can be used with other data, so they are not specific to this task.

Ordinal Representation of Categorical Variables

The most popular implementations of decision trees based models available in Python treat all variables as numeric. In this case there are some alternatives, such as one-hot encoding, so that the model can find unique characteristics of the levels of categorical variables.

But by the very nature of the model, coding variables ordinally allows decision trees to approximate the value of each category as if it was coded in the correct way.

In this case we have a categorical column in which each level is represented by a number.

Record Counts

The number of times a record appears in the data can also have predictive value.

A common, and quite useful, transformation is to replace the categorical values ​​by counting records that belong to that level. In this case there was a significant improvement in the error.

Minimum and Maximum Quantity Available

Utilizing the fact that about 90% of the tubes had their price varying with the purchased quantity, I realized that the more expensive tubes had a lower maximum purchase quantity than cheaper tubes.

Most of the more expensive tubes had their lowest price achieved when buying 5 units, while the cheaper tubes went until 250 units.

This attribute was very important, and demonstrates the value that exists in trying to understand the data and the task in question.

Information Leaks in IDs

This is the situation where we have information in the training data that allow us to discover patterns in the dependent variable, but which should not exist in reality.

In this case, each tube was identified by a number, and when sorting the data using this variable, I realized that there were small tube clusters with similar prices and IDs. This means that the ordering of the tubes had a predictive value.

To use this information, I simply created a variable with the numeric part of the IDs.

Both in competitions and in practice, it is important to look for these patterns. In the first case, this can help you achieve a better result. In the second case, all information leakage must be removed so as not to compromise the performance of the model in a production environment.

Weight of Components

Each tube had different components that would be attached to it. The characteristics of each component varied, but one of them stood out, the weight.

With access to the quantity and type of component used in each tube, I calculated the sum of the weight of the components of each tube, and this was a variable with a significant correlation with the price, and small correlation with other features.

Machine Learning Models

Gradient Boosted Trees

Famous for his good performance in competitions of different areas, this was my best model. I used the XGBoost implementation, which is one of the best open source versions I know of this model.

XGBoost has the option to optimize the RMSE, but as we are optimizing RMSLE in this competition, ie the logarithmic error, I made the transformation of the dependent variable with the log (y + 1) function. There were suggestions to use the root of the variable instead of the log, and my best model ended up being trained on the variable transformed with the 15th root.

This model, alone, was enough to get the 10th place.

Regularized Greedy Forests

This was the novelty of this competition. Although I tested this model in the Facebook competition, I had never dedicated time to make it work.

The algorithm is basically doing the following: it optimizes a loss function equal to Gradient Boosted Trees, but with a regularization term. In addition, it periodically readjusts the weights of each terminal node of the ensemble trees.

In theory, it’s an improvement over Gradient Boosted Trees, but in this competition it did not present a better performance. I believe that with more time to adjust the parameters and train it is possible to achieve a better result.

Other Models

The team trained two other types of models to compose the ensemble, but they alone did not perform well:

– Neural networks
– Factorization Machines

Ensemble (Stacking)

The ensemble was done in three levels, using models of all the members of the team.

At the first level, the models received the original data and made their individual predictions on data outside the training sample in the same cross-validation scheme.

At the second level, four models used predictions created at the previous level as training data, and again made predictions using a cross-validation scheme to avoid overfitting.

At the third level, the mean of the predictions of the second level models was used as the final forecast.


In this competition, it was extremely unlikely to win without forming a good team. The use of solutions created by different members, which eventually improved the predictions in the ensemble, was essential.

In addition one of the most important factors for winning a Kaggle competition is time. The more time you have to test ideas, the greater the chances of success.

Determining Product Relevance in a Search Engine With Machine Learning

One of the most important tasks for an e-commerce site is to ensure that users find the products they are looking for. This includes a search engine that returns relevant products. In this competition, CrowdFlower, a specialized platform for data collection, made product search data available, and the median relevance score given by three people for each pair of product and search term.

The goal was to create a model that could calculate the relevance of a product to a given search phrase.


Approximately 10,000 training records were available, and 20,000 were separated for testing. In these records there was the search term, title and description of the product, the median of relevance and the variance of the relevance attributed by the users.

In the test samples there was a portion of “false” records, which were not used to calculate performance, but were placed to avoid human labeling.

Evaluation Metric

The evaluation metric was the Quadratic Weighted Kappa. It measures the disagreement between two “evaluators”. In this case, it measured the disagreement between the score given by the model and the score given by humans. It is sensitive not only to accuracy but also to the score (label) distribution. A random solution would result in a Kappa of 0, and total agreement, in Kappa equal to 1, but there are cases where Kappa becomes negative.

Data Processing


The basic transformation of title and description is to make a bag of words. In its simplest form it is a matrix whose rows are the examples (documents) and the columns are words (terms). The value of each cell is the number of times the word appears in that document.

A more advanced transformation is to take this matrix and do some operations using the frequency of terms in each document, and also in the document body. This is called TF-IDF. In this case, the most frequent terms within a specific document receive a greater weight, but that is discounted by a factor inversely proportional to its frequency in other documents.

Put simply: Rare words that appear a lot in a document are given more weight, and frequent words that appear in all documents become less important.

Other than that, it is important to determine a minimum limit of times that the word should appear in the body of documents, to become a column, in order to reduce noise.

The best representation I found was to train a matrix based only on titles, and another only in the description. Some participants suggested training in both at the same time, but since this is a search problem, I believe it is important to differentiate them.

SVD – Latent Semantic Analysis

As you might expect, the matrix of documents is huge, and it’s pretty sparse. This makes it difficult to use with some non-linear models. In this case there is the option of reducing dimensionality using SVD. This linear algebra algorithm basically finds the main components of the document matrix, where there is greater variation. This reduces the dimensionality of the problem, but can cause loss of information.

In Natural Language Processing they end up working well, because they filter a bit the noise of the data. Applying the SVD to the TF-IDF matrix gives you a representation known as Latent Semantic Analysis. The interpretation is that each component obtained refers to a “concept” represented by the documents. It is also widely used to visualize documents in a space with small dimensions.

In the competition I used SVD to reduce the dimensionality of the TF-IDF matrix before feeding it to the models.

Other Attributes

In addition to the above features, I created others that seemed to make sense, and some that I found in related academic papers. The most important of these was the percentage of search words that were in the title. In addition, I created attributes based on the average similarity between a product and the others of the same search term, as well as the count of times the search term appears in the data.

An important part was coding each search term with one hot. That is, for each of them I created a column that had the number 1 if it was present, and the number 0 otherwise. I did this to help the model capture the predictive value inside each search term.

Regression or Classification?

There were two ways to look at this task: to consider each level of the median as a class, thus being a classification problem, or to consider as a regression problem and to round out the predictions.

At first I thought it was more appropriate to treat it as a regression, but since Kaggle wanted the rounded predictions, I ended up creating several classification models and did not have time to test the other alternative.

With the classification we have two options: use the class provided by the model, or use the probabilities of each class to make a weighted average to get the score. The first alternative proved to be better in this case.


Support Vector Machine

The best models were obtained using SVM. It is one of the most used models for text classification, so it is no surprise that it performs well.

The best individual model, using the SVD components and the other attributes, reached a Kappa of 0.672 in the cross-validation, and 0.654 in the LB.

Random Forest

One of my favorite models, but not usually the best choice for sparse data in high dimensions. Anyway, after several tests, I got 0.6695 in cross-validation, and 0.6508 in LB using a TF-IDF matrix with the other attributes.

One reason that may help explain the fact that it behaves better with the TF-IDF matrix than with the SVD matrix is ​​that this is a method that reduces variance, so it turns out to be robust even with many attributes, as long as they are relevant to the prediction.

Gradient Boosted Decision Trees

Although this model is also not very common for use with sparse data in high dimensions, I decided to train several “weak” models and stack using their predictions and the other attributes, which did not involve a representation of the words.

The stacked models had an average of 0.60 Kappa in validation.

The score for this model was around 0.624 on LB. But the real intention with it was to have a different model to ensemble.

Post Processing

In all of the Top 10 solutions I read, there was one detail in common: post-processing predictions. Teams used special methods for rounding, multiplying by coefficients, or determining different cut-off points for predicting each “class”.

In one case, the team ordered the predictions and, according to the proportion of each class in the training data, rounded the forecasts. Kappa is a metric sensitive to the distribution of predictions, so this ended up helping.

The closest I got to an adjustment in the predictions was to use the geometric mean instead of the simple average for the ensemble. The models used to overestimate the relevance of the products, and this helped bias the predictions to the lower classes.

Ensemble and Overfitting

In the end, I made a geometric mean with my best models. The ensemble consisted of several SVMs, a Random Forest and a Gradient BoostedD Decision Trees.

Although not a good practice, especially in cases with a small amount of data like this competition, I used LB to validate the ensemble. This ended up making me choose my second best ensemble, yet I ended up in the Top 5% of the competition.

How I Got 6th Place Among 985 Participants in the Facebook Machine Learning Competition

In this recruiting competition, Facebook provided anonymous data about users of an auction site. The task was to create a model that determined which user is human and which robot. The human users of an auction site do not want to have to compete with machines. A site with many robots can experience an exodus of human users, so it is important to have a robust system for detecting suspicious users.

According to the administrators, this data did not come from Facebook, but I think the interest was to create an environment similar to their advertising auction system.

This competition has become quite special for me, since it was my first Top 10 on Kaggle, and I ended up becoming Kaggle Master. This was my first goal, but it also has the added bonus of being considered for an interview for the Data Scientist position at Facebook.


One of the most interesting parts of this competition, which made it very close to a real case of data science, was the fact that we did not have a raw file with variables related to the objective that could be used directly with some model of machine learning.

Two files were available: the first contained about 2000 rows with anonymous information from users, such as ID and Address. The second contained more than 7 million bids made on the site.

Variables in the second file included: Bid ID, User ID, Auction ID, auction category, user device, time, country, IP, and URL. The time was coded, so it was not possible to find out the exact date of the bids.


Having a robust validation environment is essential for any machine learning task. In this case there was the challenge of having only 2000 examples to train and validate. After testing several combinations, I decided to use a stratified cross-validation, with 5 folds, and repeat this validation 5 times.

The repetition ends up causing the examples to be randomly placed in different folds, which helps stabilize the error estimate, since one or another particular example will have less influence on the mean.

Feature Engineering

The most important part of any data science job is to ensure that you have the correct data to do the prediction. A very common mistake is to think that you just put the data in the algorithm, and it will learn what you want. This was by far the most important part of my solution.


The first step was to create variables with the count of how many bids each user made. Extending this idea, I created variables for counting how many different devices, IPs, categories, countries, and URLs the same user had used. With these simple variables, a Random Forest, without tuning, reached 0.86 AUC in cross-validation with 5 divisions.

In the case of countries, in addition to integrating the variable itself to count the number of different countries, I made individual variables for each of them, counting how many bids the user made coming from each country. This helps the model identify “risky countries,” where it may be easier to host robots to bid.

Simple Statistics

After seeing that the counts had great predictive value, I decided to test simple statistics, such as averages and deviations. Examples of important variables:

– Average and standard deviation of bids per auction;
– Maximum number of bids given in the same auction, with the same timestamp;
– Average time between bids.


The variable that indicated the time in the original data was coded so that you could not know the exact date that the bids occurred. But after analyzing the data, I realized that even though I did not know the exact date, it was possible to identify what would be days, hours and minutes.

This allowed me to create variables related to averages, deviations, proportions and counts based on units of time. Some important variables:

– Standard deviation of bids per day;
– Average bids per day;
– Average bids on the same timestamp;
– Time-of-day bid count;
– Average auctions that the user participated in the same day.

It seems that all the participants who finished at the top of the LB were able to identify these time patterns.


Some proportions based on counts and time were also important. Some examples are:

– Maximum proportion of total bids in a day assigned to the user;
– Density of bids per hour of day


When you have variables that contain enough information about the event, the selection and adjustment part of the model’s parameters becomes simpler. In this case, despite briefly testing linear models, I ended up choosing to better explore models based on decision tree ensembles.

Random Forest

The first model I tested, and the one that proved itself most useful in this case.

I believe that its superior performance was given by being a model more directed to the reduction of the variance. Because we had little data to train, it was important that the model was able to cope with extreme values, and stabilize predictions.

A Random Forest with 500 trees and parameters adjusted by cross-validation had 0.9112 AUC in the local measurement, and 0.9203 in the LB.

Gradient Boosted Trees

This is one of the most used models in data science competitions because it offers a good performance in several types of tasks. In this case it was not so. My best models were around 0.90-0.91 AUC on LB despite reaching 0.94 on validation.

This raises the question: we have a decision tree-based model that gives us a good performance (Random Forest) and, in theory, boosting would improve the score, why is it not the case here?

My answer is: Random Forest increases the bias, that is, it makes the model more rigid, with less variance, while the GBT increases the variance, being more sensitive to the noise present in the data. In this case we have less than 2000 examples, and a strong imbalance between classes. These two combined reasons generate data with a good amount of variance, increasing it further will cause overfitting.

The solution would be tuning and regularizing the GBT, but with little data to validate it is difficult to trust the results. So I decided to rely on theoretical reasons and continue working with Random Forest to have a more stable model.


The final touch of most of the best competition solutions is to make an ensemble with the best models.

Random Ensemble

Due to the small number of examples, it was difficult to rely on the parameters found through validation. So I decided to create an ensemble with Random Forests randomizing the parameters. Each had a seed and a few different parameters. This ensemble achieved an AUC of 0.92 in the out-of-bag estimate and in LB.

AdaBoost Random Forest

Despite the instability of the GBT, I decided to apply boosting to Random Forest. Perhaps applying this method to more stable models than decision trees could help, even with the small number of examples. And that’s exactly what happened.

In the validation it obtained 0.925 of AUC, and 0.928 in LB.

There was still a difference between the validation score and the LB when I added new variables.

Stabilizing the Seeds

To stabilize the predictions I decided to run the Boosted Random Forest model using several different seeds, and make the average. This did not give a good result in LB, but it had stability in the cross validation, which made me trust the model.


In the end, I used a Boosted Random Forest model with a set of attributes that presented AUC 0.94 in cross-validation, and it was basically my score in the final test set of the competition, which gave me the 6th place.