In recent years the insurance industry has been looking for ways to improve its models using Machine Learning. One is to use data that goes beyond a form completed by the insured to determine the risk of accidents.
One of the methods used is to use driver behavioral data obtained through GPS tracking. In this way it is believed that it is possible to capture information and profile patterns that go beyond traditional methods.
What makes a driver different from the other? Identifying the driver who is driving a vehicle during a trip is the first step in getting a good model.
The insurance company AXA decided to provide anonymous trips of approximately 2700 drivers. For each driver there are 200 trips, some of which are not from the driver in question, and the task was to create a model that would identify which trips were attributed to this driver incorrectly.
To avoid information leakage, the data has been centralized and rotated in a random manner, and some portions of the trips removed from the beginning and the end.
In total, 200 trips of 2736 drivers were made available. Within the folder of each driver was a random amount of trips falsely attributed to the driver in question. An important information is that we had the assurance that most trips truly belonged to the driver.
Each trip was described in a CSV file with a numerical value for the position x and y, which would be the distance from the origin, in meters, and each line corresponded to a 1 second displacement.
The metric used for evaluation was the area under the ROC curve, abbreviated as AUC.
The features I used are based primarily on the first three derivatives of the displacement: speed, acceleration, and jerk. Other derivatives undermined performance.
Frequencies – FFT
I tried to extract frequencies from both raw raw data and derivatives. I did this with the FFT (Fast Fourier Transform) function available in Numpy. I divided the frequency range by 20, and the value of the features was the sum of the frequencies in their respective buckets.
In this case I extracted the isolated frequencies for each component (x and y). Maybe getting the frequencies using speed and acceleration with the two components (x and y) together, would have been better, but I did not test this hypothesis.
The maximum AUC I achieved, using frequencies of the three derivatives, was 0.7058, with a logistic regression without parameter optimization.
This is the transformation that found in all academic papers I read on the subject. It consists of making a histogram of attributes. In this case, we get a new parameter, the size of the intervals.
The best way I found was to divide the magnitude of the velocity in 50 intervals, and the acceleration in 10.
Two details that improved performance: using the “density” attribute of the Numpy function, which calculates the probability density function, normalizing the results so that their sum is 1.
Another frequent suggestion was to use the value of attributes at certain percentiles. This helps you figure out how extreme the values are.
An example may facilitate understanding: in the case of velocity, in order to obtain the value in the 5th percentile, I ordered the values incrementally and located the list point that represented the border of the first 5% of the ordered data.
If we have a list with 100 ordered velocity values with an index ranging from 1 to 100, we would get the value at position 5. The value at position 50, for example, would be the median.
This is important because it differentiates drivers who have a faster or slower profile.
I used the 5th to 95th percentiles, with intervals of 5.
One approach that was important in getting the best teams to the top of the leaderboard was to create an algorithm to find similar trips. It was remarkable that some stretches, although they had a different angle with respect to the origin, were the same.
I created a simple algorithm, based on the speed of the driver during the trip but, despite having found some similar trips, it ended up hurting the score.
The algorithms that succeeded were more complex, and in some cases involved the transformation of variables, reduction of dimensions, and limits to determine similarity. Of all, two methods seemed essential:
Ramer-Douglas-Peucker : because each trip had a different duration, this method was used to reduce the number of points.
Dynamic Time Warping : This method was used to compare parts of the trip and determine the similarity between them.
As it was not possible to really know what were the travels of each driver. That is, there were trips marked positive, when in fact they were negative, there was the opportunity to decide the best way to sample the data.
At first, with unsupervised methods, I used only the 200 trips of each driver. One suggestion in the competition forum was to use the 200 trips of a driver as positive examples and use trips of other drivers as negative examples.
Overall this was a very good approach. Among the ways of doing it were:
-Randomly pick an N amount of drivers and treat all of their trips as negative.
-Choose, randomly, a V amount of trips from N different drivers, to use as negative.
The approach that worked best for me was the second option: selecting a small number of trips from several different drivers.
Similar / different samples
Another alternative that I tested was to select negative samples of drivers that were more similar or different from the positive samples, according to a measure like Euclidean distance.
I created attributes such as the average speed, distance traveled, and duration of trips for each driver and determined which were the most similar, and which were the most different of the driver in question.
Although the approach using the “most different” drivers showed improvement during cross-validation, it did not improve the score in the test set. The reason I attribute to this is that the validation data contains travels that are false, but are marked as true, whereas in the test set we have the absolute reality.
I tried to reinforce the model predictions by training more than once, but now using the predictions from the previous model instead of the original dependent variable, as follows:
In the first pass I did the normal procedure, using the driver’s trips as positive, and others randomly selected from other drivers as negative to obtain forecasts.
In the next passes I used the classes provided by the previous model as dependent variable. This improved greatly in cross-validation, but showed a very small improvement in Kaggle’s validation data.
Models and Methods
At first I decided to group the trips according to similar characteristics. This idea was based on the assumption that, since most trips belonged to the driver, it would be possible to separate the trips into two clusters, and the one with less trips would be the “negative.”
Another idea was to increase the number of clusters, since there were differences of between the trips, even when they belonged to the same driver.
Either way, these ideas proved useless, with AUC close to 0.5.
Another unsupervised way I tried was to calculate the cosine similarity between the average of the attributes of the trips and the individual records. In this case the “probability” of a trip belonging to the driver would be one minus this value. Which also proved useless.
My most successful model was logistic regression. This surprised me, since the success described by several participants in the forum involved boosted trees.
I used about 1000 examples, 200 positive and 800 negative. An important detail was to enable the option “class_weight = auto”. This option assigns a different penalty to the examples, according to the class distribution.
In this case, an error or hit in the positive class was worth more than the same in the negative class. In this way, although we had more positive examples, the penalty for errors was balanced.
I trained a regression for each driver, personalizing the regularization penalty (L1 or L2) and its coefficient. Each driver model performed a search for the values for this coefficient using the two types of regularization, and at the end selected the type of regularization and the coefficient that presented the best AUC in cross-validation.
The best individual model reached the AUC of 0.86.
Random Forests / Boosted Trees
According to the forums, the most used model was Gradient Boosted Trees. Basically it consists of training a sequence of decision trees, assigning a greater weight to examples that were incorrectly classified by previous trees. The probability of a specific example is a linear combination of the predictions of each tree.
In my case the attempts to use this model were not successful. Two reasons may have contributed:
– For lack of confidence in validation, I ended up not paying much attention to optimizing his parameters to achieve a good AUC.
– The cross-validation process to find good parameters was going to take a long time, and preliminary tests had not yielded good results.
Anyway, I trained a model of Random Forests, with the standard parameters, to use in an ensemble. It achieved an AUC of 0.847.
Towards the end of the competition I decided to go to the “easy” way to improve the score: ensembles.
For a part of the ensemble I trained the following models:
– Logistic regression with L1 regularization only
– Logistic regression with L2 regularization only
– SVM RBF
– Linear SVM
– Random Forests
Although SVM did not offer probabilities, I thought it would be a good contribution to the ensemble, which proved to be true.
The simple mean of the predictions of these models produced an AUC of 0.8825.
Finally, I joined these with everything else I had produced since the start of the competition, for a final AUC of 0.8915. This guaranteed me a final position among the 15% best solutions of the competition.
Methods of Winning Solutions
The winning solutions used 400 examples for each driver, selecting random trips from other drivers as negative examples. The most used model was Gradient Boosted Trees.
In addition, complex algorithms to find similar trips and assign them directly to the driver have squeezed the necessary points to rank among the top 10 solutions.