WalMart is a company with thousands of stores in 27 countries. It is possible to find several articles on the technological mechanisms used to manage the logistics and distribution of the products. It is the second time they offer a contest at Kaggle with the intention of finding interview candidates for data scientist jobs.
A major advantage of this type of competition is that we have access to data from large companies, and understand what problems they are trying to solve with probabilistic models.
The objective of the competition was to create a model that could predict the amount of sales of some products in specific stores in the days before and after blizzards and storms. The example given by them in the description of the task was the sale of umbrellas, which intuitively must see an increase before a great storm.
Two files were made available to train the model: one of them contained information about the identification of the stores, products, and the nearest weather stations. The other contained weather data for each station.
In total, data were available on 111 products whose sales could be affected by climatic conditions, 45 stores, and 20 weather stations. The goal was to predict the amount of each product in each store that would be sold 3 days before, 3 days later, and on the day of the weather event.
A weather event mean that more than an inch of rain was recorded, or more than two inches of snow, that day.
The combination of stores and products gave us about 4.6 million examples for training, and about 500,000 for testing. Each example referred to a day, store and product.
The metric used was the Root Mean Square Log Error. It is basically the RMSE applied to the log (X + 1) transformation of the predictions. This means that errors in predictions that should be close to 0 would be punished more severely than errors in predictions with higher numbers. For example, predicting 5 items when it should be 0, has a greater penalty than predicting 105 when it should be 100.
Transforming the Data
Since I only had 10 days to work in this competition, I decided to check how far it was possible to come up with a model based only on the original variables and, who knows, make an ensemble.
One difference from this to other competitions is that, even though the data was organized, you were responsible for linking climate variables to the identified product and store data. It makes perfect sense because Walmart would not want a data scientist who does not know how to manipulate data.
For this I used Pandas . This Python library for data manipulation is one of the most widely used, and strongly remembers data structures available in R.
At first I used all the variables as numerical, trained an XGBoost with slight tuning, excluding the variables that coded special alerts, and used 10% of the dataset to determine the number of trees. As expected, the result was poor, about 0.1643 on LB.
After testing the first model, I coded the categorical variables with one-hot encoding. That is, for each level of the variable a column with the indicator 0 or 1 was created, if the variable was present in that example. Normally the number of columns should be the number of levels minus one, so that you do not have problems with collinearity. Since I planned to use models that were not sensitive to this problem, I did not bother deleting one of the columns.
After tuning using a subset with 10% of the data, and getting the RMSLE of 0.1216 in the validation, I sent the solution, and got the value of 0.1204 in the LB, a good improvement over the previous one.
A lot of weather data were missing, so I decided to test a simple imputation method: replace the NaNs by the mean of the column values. After tuning again, now with parameters for these new values, I obtained the RMSLE of 0.1140 in the 10% validation and 0.1095 in the LB.
I did not explore the temporal dimension of the data very much, in one of the attempts I added the previous day’s meteorological information, which reduced the error to 0.1083.
One method that worked very well in my first competition, and that I always try to do, is to divide the training set into small subsets related to some variable and train a specific model for each one. In this case, I decided to create 19 models, one for each of the 19 weather stations present in the test set. The RMSLE in LB was 0.1101. Worse than with a model that treats the stations as variables in the entire dataset.
A serious problem with this approach is to try to use the same model, with the same parameters, for different datasets. Knowing this, I decided to make a small tuning of the parameters for each dataset, which reduced the LMS RMSLE to 0.1069.
Despite the small difference, it seems the division into individual models for each station captured some information that was not present in the model with all considered together.
Of the models I tested, two stood out: Gradient Boosted Trees (XGBoost) and Random Forests.
I had used Random Forest for regression only once, in a job, but never with a large amount of data. After tuning the parameters, applying the model to the imputed data, it resulted in a RMSLE of 0.1166 on the LB.
XGBoost presented the best error of an individual model. In addition to adjusting parameters, such as the depth of the trees, using a subset of the data, it was necessary to adjust the number of trees and the learning rate. Usually a small learning rate and a lot of trees is the safe recipe to improve performance, in exchange for more time to train the model.
Since XGBoost is sensitive to the seed of the RNG, I decided to make an ensemble of XGBs by just changing this value. This method marginally improved my score in other competitions, but in this the impact of it was greater because of the following: XGBoost has a function that allows to leave data separated so that it determines the number of trees that minimizes the error. In this case I decided to use the function, and I left 5% of the data separated. In addition to varying the seed to the XGB itself, I varied the seed to split the data, which made the models look more diverse, which is essential for a good ensemble.
The most I got to train was 10 XGBoosts in this scheme. Although it was a very stable model, the RMSLE of the ensemble was 0.1041, presenting a reduction compared to 0.1095 of the single model.
In the end, I put together all the solutions that I had submitted, and ended up getting a RMSLE of 0.1028, guaranteeing a position among the top 20%.
One day after the end of the competition I reviewed the variables of my XGBoost, and found that the variables that identified the products (item_nbr) were not in binary format, and were considered by him the most important. With the correct coding I believe that it would be possible to reduce the error more, and achieve a better final position. Although trees are very good to capture patterns even with categoricals in this format.