Caterpillar is a manufacturer of industrial equipment such as tractors and engines. To maintain their operations, they need to buy tubes of various specifications, from several different suppliers, to use in their production line. Each supplier and product has a different pricing model.
The task in this competition was to create a model that would be able to price the tubes using historical supplier data and product characteristics.
I had the pleasure of competing on the team that won this competition, surpassing more than 1300 teams of data scientists from around the world. In this article I want to describe the most important parts about creating the solution that secured this victory.
Having a better familiarity with my part of the solution, a good part of this article refers to it, but I want to make it clear that the winning solution was created using different models developed individually by each member. None of us would have managed to win alone, and all were essential for the team to reach first place.
Several files with characteristics of the tubes and the suppliers were available. There are two basic categories of pricing: fixed price and variable price according to quantity.
There were about 30,000 lines for training and 30,000 for testing. Something important to consider is the fact that several lines referred to the same tubes, changing only the minimum amount needed to purchase to get that price.
Among the available variables were: quantity, tube ID, date of purchase, annual use forecast, diameter and length.
This is one of the most important parts of any data mining task. It is essential to create a validation environment that has the same or very close characteristics of the production environment. In our case, the test data.
This means that it would not be enough to simply shuffle the lines and distribute them in folds as in “standard” cross-validation. In this case, I decided to create folds based on the tube IDs. In cross-validation, instead of distributing rows randomly, each fold contained all rows of a given tube.
Although it was a time series, training and test data were not split between past and future, so taking the date into account to define the validation data did not bring benefits.
This validation scheme proved to be quite robust and close to both the public leaderboard and the evaluation in the test data (private leaderboard).
After a robust validation environment, the most important part is finding data characteristics that can be transformed into variables to feed the model.
I will present transformations that, in general, can be used with other data, so they are not specific to this task.
Ordinal Representation of Categorical Variables
The most popular implementations of decision trees based models available in Python treat all variables as numeric. In this case there are some alternatives, such as one-hot encoding, so that the model can find unique characteristics of the levels of categorical variables.
But by the very nature of the model, coding variables ordinally allows decision trees to approximate the value of each category as if it was coded in the correct way.
In this case we have a categorical column in which each level is represented by a number.
The number of times a record appears in the data can also have predictive value.
A common, and quite useful, transformation is to replace the categorical values by counting records that belong to that level. In this case there was a significant improvement in the error.
Minimum and Maximum Quantity Available
Utilizing the fact that about 90% of the tubes had their price varying with the purchased quantity, I realized that the more expensive tubes had a lower maximum purchase quantity than cheaper tubes.
Most of the more expensive tubes had their lowest price achieved when buying 5 units, while the cheaper tubes went until 250 units.
This attribute was very important, and demonstrates the value that exists in trying to understand the data and the task in question.
Information Leaks in IDs
This is the situation where we have information in the training data that allow us to discover patterns in the dependent variable, but which should not exist in reality.
In this case, each tube was identified by a number, and when sorting the data using this variable, I realized that there were small tube clusters with similar prices and IDs. This means that the ordering of the tubes had a predictive value.
To use this information, I simply created a variable with the numeric part of the IDs.
Both in competitions and in practice, it is important to look for these patterns. In the first case, this can help you achieve a better result. In the second case, all information leakage must be removed so as not to compromise the performance of the model in a production environment.
Weight of Components
Each tube had different components that would be attached to it. The characteristics of each component varied, but one of them stood out, the weight.
With access to the quantity and type of component used in each tube, I calculated the sum of the weight of the components of each tube, and this was a variable with a significant correlation with the price, and small correlation with other features.
Machine Learning Models
Gradient Boosted Trees
Famous for his good performance in competitions of different areas, this was my best model. I used the XGBoost implementation, which is one of the best open source versions I know of this model.
XGBoost has the option to optimize the RMSE, but as we are optimizing RMSLE in this competition, ie the logarithmic error, I made the transformation of the dependent variable with the log (y + 1) function. There were suggestions to use the root of the variable instead of the log, and my best model ended up being trained on the variable transformed with the 15th root.
This model, alone, was enough to get the 10th place.
Regularized Greedy Forests
This was the novelty of this competition. Although I tested this model in the Facebook competition, I had never dedicated time to make it work.
The algorithm is basically doing the following: it optimizes a loss function equal to Gradient Boosted Trees, but with a regularization term. In addition, it periodically readjusts the weights of each terminal node of the ensemble trees.
In theory, it’s an improvement over Gradient Boosted Trees, but in this competition it did not present a better performance. I believe that with more time to adjust the parameters and train it is possible to achieve a better result.
The team trained two other types of models to compose the ensemble, but they alone did not perform well:
– Neural networks
– Factorization Machines
The ensemble was done in three levels, using models of all the members of the team.
At the first level, the models received the original data and made their individual predictions on data outside the training sample in the same cross-validation scheme.
At the second level, four models used predictions created at the previous level as training data, and again made predictions using a cross-validation scheme to avoid overfitting.
At the third level, the mean of the predictions of the second level models was used as the final forecast.
In this competition, it was extremely unlikely to win without forming a good team. The use of solutions created by different members, which eventually improved the predictions in the ensemble, was essential.
In addition one of the most important factors for winning a Kaggle competition is time. The more time you have to test ideas, the greater the chances of success.