Telstra, Australia’s largest telecommunications and media company, sponsored a contest at Kaggle seeking data scientists that would stand out as part of a recruiting process to join their Big Data team.
Service log data from their network were made available, and the task was to create a model that could predict the severity of a failure at a particular location and time, using the data from these reports.
In total, 974 competitors from around the world participated, and this victory put me in 12th place (of 50,000+ users) in the global Kaggle ranking.
About the Data
There were about 7,400 training examples and 11,200 for testing. A small amount, which requires special attention during the creation and validation of the model.
Each example had an ID, which represented a pair between event location and time. In addition, other files were provided that contained, for each example: information on signals obtained by the current severity monitoring system when processing the logs, type of event, and severity of messages sent by it at that time.
The severity of the problem had three categories: 0, meaning that there were no problems; 1, meaning that some problems were detected in the network; And 2, representing many reported problems. Therefore, it was a classification with multiple classes.
Although there is a temporal and spatial dependence between the data, since they are from different locations, at different points in time, the data was randomly split by the organizers. It was possible to know the location, but not the time when the record was obtained.
So a simple cross-validation was the best environment I found to test my model. Cross-validation is simply the procedure of dividing your data into K parts, and doing a training cycle, using K – 1 to train and evaluating in the division left out.
In this case I used 5 divisions, which provided a fairly reliable estimate of the performance on the unseen test data revealed after the end of the competition.
In total I had 242 variables.
There was a pattern in the data, which would later be called the “magic variable” in the forums, which allowed us to exploit what seemed to be a mistake in data preparation to achieve a good improvement in model accuracy.
Although the training and test files have random IDs, the ordering of the records of the other files had predictive value. With this, you could use the ID of these files themselves, or create variables related to it, to exploit the leak.
Some people criticized the fact that the best solutions used this error to get a better score in the competition, but I think it’s important to be able to find this type of error in data, because the only thing that would change out of a competition is that it should be corrected, rather than exploited.
Therefore, it is always important to look for errors in the data. There are several types of information leaks, some very subtle, that can make your model look better than it actually is.
There was a variable indicating the location of that record. This, in theory, is a categorical variable, which means it has no natural order, but I still used it as an ordinal, instead of creating a column for each value, for two reasons: trees can capture the patterns of categorical variables even using this format (in some cases it is better to use one hot), and there could be some pattern of proximity between sites.
For example, it may be that location 201 and 202 were under the same weather conditions, which could influence the occurrence of problems in these two stations, so perhaps there was an implicit cluster there.
The service logs had what seemed to be a line for each warning signal, and a value indicating how many times the warning was triggered at that time. The organizers have not made it clear whether this was the meaning, but it is the most plausible explanation.
With this, I created a column for each of these signals, and the value for each example was the number of warnings.
These columns were quite sparse (more zeros than other values), so most of them ended up being useless, but some had predictive value.
After being satisfied with the performance of my best individual model, I decided to create an ensemble. That is, create several models that capture different patterns in the data and complement each other.
Among the 15 models that composed the final solution, two stand out: Gradient Boosted Trees (in the implementation of XGBoost) and Neural Networks (using Keras).
Gradient Boosted Trees – XGBoost
Anyone who is used to Kaggle knows that most of the competition solutions involve Boosted Trees. This model is quite powerful by itself, but the parallel implementation in the XGBoost package has proven to be very good at solving machine learning tasks in general.
In this case, my best “individual” model was one of these and would have finished in 18th place. There was still a lot that could be improved on it, but since I had only a week to go, I knew that without an ensemble it would be a lot harder to win, so I decided to end the single model optimization at this point.
A model that finds good solutions, but using a different strategy than decision trees, is a neural network. Although it did not demonstrate a performance comparable to XGBoost, it was a great addition to it in the ensemble.
I used the Keras library, which is fantastic for creating neural networks in Python.
I tried to normalize the data with all the transformers of scikit-learn, but the best was the standard (StandardScaler), which subtracts the mean and divides by the standard deviation. In addition, I used two layers of neurons and the optimizer Adam. To find a good architecture, dropout coefficient and parameters in general, I did a random search.
Neural networks seem pretty sensitive to “everything,” so I believe there are different combinations of architecture, data normalization, and variable encoding that would give equal or better results.
One of the reasons it did not get better was because I did not create a specific set of features for the neural network. I used the same set of variables from the trees. If this was my main model, I would have done it differently, but since it was just a component of the ensemble, this way was enough.
This was a brief summary of 3 weeks of work in this competition.
My initial plan was to create just one model and see how far I could go, but seeing that the model was good, I decided to try to reach my first individual victory.