The Otto Group is one of the largest e-commerce companies in the world.
According to them, due to the diversity of the company’s global infrastructure, many similar products are classified into incorrect categories. So they made available data on approximately 200 thousand products, belonging to nine categories. The objective was to create a probabilistic model that correctly classified the products within each category.
For training, about 60 thousand products were available and, for testing, about 150 thousand.
The 93 features were not identified, the only information given about them is that they were numerical variables. This makes it difficult to understand the best way to work with the data, and also the work of creating new features.
In addition, the task involves multiple classes, that is, we have to generate probabilities for 9 classes.
The metric chosen for this competition was the multi-class log loss.
This metric punishes very confident probabilities in wrong classes, and is sensitive to imbalance between classes. Some of them represented 20% of the training data, while others, 9%. In this case, trying to balance classes, whether with subsampling or penalizing smaller classes, will not help.
Even with anonymous variables, I decided to explore the possibility of integrating interactions between them, which might not be captured by the models.
I tried to create new features based on sums, differences, ratios and products between the originals. The only feature that contributed significantly was the sum of all attributes for a given product. Most of the attributes were pretty sparse (had more zeros than other values), so maybe that contributed to the lack of success.
Using a GBM / RF to Select Features
As we are talking about a gigantic space of possibilities of combinations of attributes, I decided to use a technique used by a participant of another competition and published in this link: http://trevorstephens.com/post/98233123324/armchair-particle-physicist
Basically it consists of creating some datasets with the desired interactions and training a model of Gradient Boosted Trees in each one. After that, we check which are the most important features in each dataset. The logic is: if a feature is important in several different datasets, it should be important overall.
I tried it, using 2 stratified divisions, but although they agreed on the importance of some interactions, they did not improve my base model. As I had already spent days dedicating myself to this part, I decided that I would focus more on fitting models for an ensemble.
If I had continued, perhaps selecting the X best interactions and re-adjusting the hyperparameters of some model could extract value from these variables. Or cause terrible overfitting.
Gradient Boosted Trees (XGBoost)
My individual model with the lowest log loss was created with XGBoost, which is a fast and parallel implementation of the powerful Gradient Boosted Trees. This model has already been in winning solutions for a lot of competitions and usually has superior performance.
To find the best attribute group I used a random search. I fixed the learning rate in 0.05 and varied attributes such as tree depth, minimum examples that should compose a node, and the proportion of examples and variables that the algorithm should randomly select to create each tree.
Usually the lower the learning rate, the better the accuracy, but more trees are needed, which increases training time. In the end, I left this value at 0.01 and trained 2000 trees.
These attributes control overfitting. After finding the best attributes, the log loss of cross-validation in 3 divisions was 0.4656, and in the leaderboard, 0.4368.
Although I didn’t have initial success with Random Forests in this competition. One suggestion given in the forum was to calibrate the predictions. Recently the scikit-learn developer team has made available a new tool that allows us to adjust the output values of a model so that they become closer to the real probabilities.
A Random Forest usually makes a vote among its components, and the proportion of each class is given as the probability of the example belonging to that class. These proportions do not necessarily match the actual probabilities of the events, so we will only have true probabilities if we calibrate.
As this competition asked us to predict the likelihood of a product belonging to a category, it was very important to have the probabilities adjusted correctly. In this case, I used the new scikit-learn tool with 5 data splits.
This greatly improved Random Forest’s predictions, and although they also use decision trees, making them similar to the GBM, they ultimately contributed significantly to an ensemble.
I had the opportunity to learn two great modules to train neural networks in a simple way in Python. They are Lasagne and NoLearn. With them, it is simple to create and train state-of-the-art neural networks, including using GPUs for processing.
But make no mistake, although they facilitate implementation, a neural network requires a lot of decisions to become useful. We have to determine the architecture, the methods of initialization of weights and regularization. Some researchers suggest doing a random search for initial parameters and continue to adjust manually.
In this case, the architecture that served me well was the following: 3 layers of neurons with the following number of units in each one 768-512-256. I trained two versions, one of them with a dropout of 0.5 between the hidden layers. And another with dropout of 0.5 only on the first hidden layer. The hidden layer units were ReLu, and the output unit, softmax.
An interesting interpretation of another competitor who came to a similar architecture is that the first hidden layer allows the network to make random projections of the data, and the smaller layers that come after it, seek a synthesis of the information. Neural networks are naturally difficult to interpret, but I found this a very interesting point of view.
Finally, since the solution of a neural network is a local minimum, and dependent on the initial weights (and we have many initial weights), I decided to average 5 neural networks with different initial weights. This average gave me a score of 0.4390 on LB, comparable to XGBoost.
I still tried to train: SVM, Nearest Neighbors, Logistic Regression, but none of them performed well or contributed significantly to the ensemble.
Adjusting the Hyperparameters
Since we did not have access to the contents of the variables, it was vital to adjust the parameters to have the best models. At first I did a random search with reasonable values, but would allow the models to explore the space a bit.
Usually the hyperparameters of a model are not independent, so adjusting one at a time will not get you the best combination. It is therefore important to perform a scan with parameter combinations. Unfortunately most of the time it is impossible to test all possible combinations, but when doing the random search we can get an idea of the solution space, without having to explore it exhaustively.
Once this exploration is done, it is good to manually vary some of them, and see if it is possible to improve performance on some neighboring combination.
In the end my solution was a weighted average of the following models:
– Gradient Boosted Trees (XGBoost) on the original variables and the sum of the columns.
– Random Forest on the original variables, log (X + 1), and sum of the columns.
– Neural network with 3 layers and dropout of 0.5 in each, on the original variables, log (X + 1) of them, and sum of the columns.
– Neural network with 3 layers with dropout of 0.5 in the first hidden layer, without dropout in the second, on the original variables, log (X + 1) of them, and sum of the columns.
In addition to these models, I’ve added a bias, simply predicting the proportion of each class as a probability of an example belonging to them.
This solution hit the log loss of 0.4080, and guaranteed me the 35th position among 3514 teams (Top 1%).
The winning team’s solution consisted of a three-layered ensemble:
In the first layer 33 different models were trained, varying both the type of model and the variables used to train each one. Some were multiple models trained with bagging.
In the second layer, the predictions of these models were used to feed an XGBoost, a neural network, and an ExtraTrees with Adaboost.
In the third, and last, layer, they bagged the three models of the second layer (totaling 1100 models) and then did a weighted average among them.
In addition, new variables were created, based on the distance of the example to the closest examples of each class, as well as others based on the TF-IDF and T-SNE transformations of the original dataset.