fbpx

Mês: junho 2017

How to Win the Machine Learning Competition of Australia’s Largest Telecommunications Company in Just 19 Days

Telstra, Australia’s largest telecommunications and media company, sponsored a contest at Kaggle seeking data scientists that would stand out as part of a recruiting process to join their Big Data team.

Service log data from their network were made available, and the task was to create a model that could predict the severity of a failure at a particular location and time, using the data from these reports.

In total, 974 competitors from around the world participated, and this victory put me in 12th place (of 50,000+ users) in the global Kaggle ranking.

About the Data

There were about 7,400 training examples and 11,200 for testing. A small amount, which requires special attention during the creation and validation of the model.

Each example had an ID, which represented a pair between event location and time. In addition, other files were provided that contained, for each example: information on signals obtained by the current severity monitoring system when processing the logs, type of event, and severity of messages sent by it at that time.

The severity of the problem had three categories: 0, meaning that there were no problems; 1, meaning that some problems were detected in the network; And 2, representing many reported problems. Therefore, it was a classification with multiple classes.

Validation

Although there is a temporal and spatial dependence between the data, since they are from different locations, at different points in time, the data was randomly split by the organizers. It was possible to know the location, but not the time when the record was obtained.

So a simple cross-validation was the best environment I found to test my model. Cross-validation is simply the procedure of dividing your data into K parts, and doing a training cycle, using K – 1 to train and evaluating in the division left out.

In this case I used 5 divisions, which provided a fairly reliable estimate of the performance on the unseen test data revealed after the end of the competition.

Feature Engineering

In total I had 242 variables.

“Magic” Variable

There was a pattern in the data, which would later be called the “magic variable” in the forums, which allowed us to exploit what seemed to be a mistake in data preparation to achieve a good improvement in model accuracy.

Although the training and test files have random IDs, the ordering of the records of the other files had predictive value. With this, you could use the ID of these files themselves, or create variables related to it, to exploit the leak.

Some people criticized the fact that the best solutions used this error to get a better score in the competition, but I think it’s important to be able to find this type of error in data, because the only thing that would change out of a competition is that it should be corrected, rather than exploited.

Therefore, it is always important to look for errors in the data. There are several types of information leaks, some very subtle, that can make your model look better than it actually is.

Location

There was a variable indicating the location of that record. This, in theory, is a categorical variable, which means it has no natural order, but I still used it as an ordinal, instead of creating a column for each value, for two reasons: trees can capture the patterns of categorical variables even using this format (in some cases it is better to use one hot), and there could be some pattern of proximity between sites.

For example, it may be that location 201 and 202 were under the same weather conditions, which could influence the occurrence of problems in these two stations, so perhaps there was an implicit cluster there.

Log Variables

The service logs had what seemed to be a line for each warning signal, and a value indicating how many times the warning was triggered at that time. The organizers have not made it clear whether this was the meaning, but it is the most plausible explanation.

With this, I created a column for each of these signals, and the value for each example was the number of warnings.

These columns were quite sparse (more zeros than other values), so most of them ended up being useless, but some had predictive value.

Final Solution

After being satisfied with the performance of my best individual model, I decided to create an ensemble. That is, create several models that capture different patterns in the data and complement each other.

Among the 15 models that composed the final solution, two stand out: Gradient Boosted Trees (in the implementation of XGBoost) and Neural Networks (using Keras).

Gradient Boosted Trees – XGBoost

Anyone who is used to Kaggle knows that most of the competition solutions involve Boosted Trees. This model is quite powerful by itself, but the parallel implementation in the XGBoost package has proven to be very good at solving machine learning tasks in general.

In this case, my best “individual” model was one of these and would have finished in 18th place. There was still a lot that could be improved on it, but since I had only a week to go, I knew that without an ensemble it would be a lot harder to win, so I decided to end the single model optimization at this point.

Neural Networks

A model that finds good solutions, but using a different strategy than decision trees, is a neural network. Although it did not demonstrate a performance comparable to XGBoost, it was a great addition to it in the ensemble.

I used the Keras library, which is fantastic for creating neural networks in Python.

I tried to normalize the data with all the transformers of scikit-learn, but the best was the standard (StandardScaler), which subtracts the mean and divides by the standard deviation. In addition, I used two layers of neurons and the optimizer Adam. To find a good architecture, dropout coefficient and parameters in general, I did a random search.

Neural networks seem pretty sensitive to “everything,” so I believe there are different combinations of architecture, data normalization, and variable encoding that would give equal or better results.

One of the reasons it did not get better was because I did not create a specific set of features for the neural network. I used the same set of variables from the trees. If this was my main model, I would have done it differently, but since it was just a component of the ensemble, this way was enough.

Conclusion

This was a brief summary of 3 weeks of work in this competition.

My initial plan was to create just one model and see how far I could go, but seeing that the model was good, I decided to try to reach my first individual victory.

Using Machine Learning to Predict the First Destination of 60,000 AirBnB Users

AirBnB is a technology company that offers a virtual environment where users can book accommodations to stay in various places in the world, and also advertise their properties to the travelers’ community.

Trying to find candidates to compose their team of data scientists, they decided to sponsor a competition in which the goal was to predict which will be the first country a new user will make their reservation.

The most interesting part for me of this competition is that it was possible to reach a good position (Top 7%) with only one model, which could be used in production right away.

About the Data

Anonymous data about the user profile, such as age, gender, account creation date, language, and the channel used to reach the site were made available. We had data since 2010.

In addition, data about user sessions, identified by the id, that described the action performed (click, a message to the owner of the property, visualization of search results, etc.) and how many seconds passed between that action and the previous one. These data existed only since January 2014.

There were two other files, with data on the characteristics of the countries, but I found no use for them.

The metric used was NDCG. It basically goes from 0 to 1 and measures the relevancy of the model results sorted by their predicted score rank. More information on this link .

A user could choose between 12 countries, but almost 90% of them went to the United States or did not make a reservation.

It soon became clear, due to the metrics used, that it would be better to focus on modeling whether or not the user makes a reservation, leaving the destination in the background.

Validation

The data provided for training and validation referred to users who created their accounts before July 1, 2014. And the test data on the leaderboard was three months after this date.

Several participants used cross-validation, randomly shuffling the examples, but because of the time to run the model during development, and also because the data characteristics depended on time, I decided to use May and June 2014 data as validation.

Some characteristics of the data, such as the proportion of users who had session data, changed over time, so I decided to use a period quite close to the test to validate. And this period proved itself quite robust.

Variables

User Profile

I used the standard user profile variables already described above. Since I planned to use a model based on decision trees, I made each categorical variable into ordinal, as these models are able to find patterns even though these variables do not have a real order.

In addition, I tried to compute the amount of time a user spent between sessions, and how many different types of actions he performed on the site.

Dates

I extracted basic date information, creating individual columns for the day, month, year, and day of the week. This helps the model capture seasonal effects of the time series.

User Behavior

This is an essential part of most e-commerce models. In this specific case, I’ve created a column with the amount of time the user spent on each action that he performed on the site. Also, I did the same procedure to calculate how many times the user executed a particular action.

Machine Learning Model

I used a Random Forest with 300 trees that was good enough to win a place in the Top 7% (91 of 1463).

In total, I had about 1100 columns, and most of them quite sparse (most of the values were zero). Some people say that models based on decision trees do not do well with this data format, but my experience suggests that it is a matter of finding the right parameters.

Unfortunately, ​I did not have time to train a good model of Gradient Boosted Trees, which would certainly have a better performance, and because of the small difference in​ scores between Random Forest and the top of the leaderboard, it would almost certainly be good enough to finish in the Top 10.

Teste