Defining a customer’s intent when visiting a store, whether real or virtual, can help a company deliver a personalized experience. That’s why it’s important to use tools that can help categorize and identify these visits.
Walmart has provided anonymous trip data for its customers to some of its stores. Among the data were the items purchased and their respective quantities, as well as descriptive information about the products and the day of the week. The task was to use this information to categorize the intent of that client’s visit.
Despite not giving us a reference table for each code, the organizers said that some examples of categories of visits are: weekly purchase of food, gifts for a celebration, or purchase of clothes for the new season.
There were about 95,000 trips for training, and 95,000 for testing. The file contained basically one line for each product purchased during a visit, and it was the competitor’s responsibility to properly transform it to merge the data for each trip.
The following information defined the products: the UPC code, with more than 100 thousand different types; The Fineline code, which is a refined category created by Walmart, with about 5,000 types; And the Department, which contained about 60 different types.
In addition, the ScanCount variable defined how many products of that type were purchased, and if it was a negative number, the customer was returning products.
Transforming the Variables
In order to train a model it was necessary, as a minimum, to group the basic information by trip. Several reports on the forum talked about transformations resulting in thousands of variables.
As I only had 7 days to work in this competition, my goal was to create a simple and compact model, with the fewest possible variables, but with a good performance to position me among the top 10%.
Examples of variables based on statistics of each trip are: the average amount of each product purchased during the trip, indication if there was a return of product, sum of the quantity of all products purchased and the department with the highest quantity of items purchased.
Counts and Proportions
I decided to add the amount of products purchased by department, and use both the count of each department and the proportion that this department occupied in this purchase. This created about 120 variables.
Singular Value Decomposition + TF-IDF
To try to use Fineline and UPC without increasing the number of variables, I decided to add the quantity of products in each of them and make two transformations.
First, TF-IDF, which replaces the quantity by a weight relative to the proportion of an item on that trip, and how often is the presence of this item on other trips.
Then I applied the SVD, which tries to find the components that have the most variation in the data.
These transformations are usually used with text, applied to the word count in each document, and known as Latent Semantic Analysis in that context.
In addition to reducing the size of the data, it is expected to remove much of the noise, and find structural categories to which the trip belongs.
In practice this helped a lot with Fineline, but it did not help much with UPC.
Logistic Regression L1 and L2
Another way to reduce size is to train a simpler model and use its predictions as a variable in the main model.
For this, I trained two logistic regressions: one with L2 penalty and another with L1 penalty. This generated about 70 variables, 37 for each regression, with the probability of an example belonging to each class.
Gradient Boosted Trees – XGBoost
Most of my time was focused on building a good model with XGBoost. This model was already good enough to stay in the top 10%.
To complement XGBoost, and try a better position, I decided to train a neural network on the same variables. It had 2 hidden layers and dropout.
Other competitors reported good results with neural networks, but I did not spend much time tinkering with them. The intention was only to achieve a slight improvement over the result of XGBoost.
Possible Results and Improvements
The final solution was a simple ensemble between a neural network and XGBoost, which was enough to secure a position among the top 7%. The solution took about 8 hours of work.
Using only the Gradient Boosted Trees model, without ensemble, it was possible to stay in the Top 9%. This model had only 200 variables. Most of the better models had more than 5000 variables. Certainly increasing the number of variables could make this model position above the Top 5%, but it would take a lot of time to train.
Other possible improvements would be: to increase the ensemble, to create a set of variables optimized for the neural network, to use directly the variables with Fineline counts, to better tune the parameters of the logistic regressions and the transformations.