One of the most important tasks for an e-commerce site is to ensure that users find the products they are looking for. This includes a search engine that returns relevant products. In this competition, CrowdFlower, a specialized platform for data collection, made product search data available, and the median relevance score given by three people for each pair of product and search term.
The goal was to create a model that could calculate the relevance of a product to a given search phrase.
Approximately 10,000 training records were available, and 20,000 were separated for testing. In these records there was the search term, title and description of the product, the median of relevance and the variance of the relevance attributed by the users.
In the test samples there was a portion of “false” records, which were not used to calculate performance, but were placed to avoid human labeling.
The evaluation metric was the Quadratic Weighted Kappa. It measures the disagreement between two “evaluators”. In this case, it measured the disagreement between the score given by the model and the score given by humans. It is sensitive not only to accuracy but also to the score (label) distribution. A random solution would result in a Kappa of 0, and total agreement, in Kappa equal to 1, but there are cases where Kappa becomes negative.
The basic transformation of title and description is to make a bag of words. In its simplest form it is a matrix whose rows are the examples (documents) and the columns are words (terms). The value of each cell is the number of times the word appears in that document.
A more advanced transformation is to take this matrix and do some operations using the frequency of terms in each document, and also in the document body. This is called TF-IDF. In this case, the most frequent terms within a specific document receive a greater weight, but that is discounted by a factor inversely proportional to its frequency in other documents.
Put simply: Rare words that appear a lot in a document are given more weight, and frequent words that appear in all documents become less important.
Other than that, it is important to determine a minimum limit of times that the word should appear in the body of documents, to become a column, in order to reduce noise.
The best representation I found was to train a matrix based only on titles, and another only in the description. Some participants suggested training in both at the same time, but since this is a search problem, I believe it is important to differentiate them.
SVD – Latent Semantic Analysis
As you might expect, the matrix of documents is huge, and it’s pretty sparse. This makes it difficult to use with some non-linear models. In this case there is the option of reducing dimensionality using SVD. This linear algebra algorithm basically finds the main components of the document matrix, where there is greater variation. This reduces the dimensionality of the problem, but can cause loss of information.
In Natural Language Processing they end up working well, because they filter a bit the noise of the data. Applying the SVD to the TF-IDF matrix gives you a representation known as Latent Semantic Analysis. The interpretation is that each component obtained refers to a “concept” represented by the documents. It is also widely used to visualize documents in a space with small dimensions.
In the competition I used SVD to reduce the dimensionality of the TF-IDF matrix before feeding it to the models.
In addition to the above features, I created others that seemed to make sense, and some that I found in related academic papers. The most important of these was the percentage of search words that were in the title. In addition, I created attributes based on the average similarity between a product and the others of the same search term, as well as the count of times the search term appears in the data.
An important part was coding each search term with one hot. That is, for each of them I created a column that had the number 1 if it was present, and the number 0 otherwise. I did this to help the model capture the predictive value inside each search term.
Regression or Classification?
There were two ways to look at this task: to consider each level of the median as a class, thus being a classification problem, or to consider as a regression problem and to round out the predictions.
At first I thought it was more appropriate to treat it as a regression, but since Kaggle wanted the rounded predictions, I ended up creating several classification models and did not have time to test the other alternative.
With the classification we have two options: use the class provided by the model, or use the probabilities of each class to make a weighted average to get the score. The first alternative proved to be better in this case.
Support Vector Machine
The best models were obtained using SVM. It is one of the most used models for text classification, so it is no surprise that it performs well.
The best individual model, using the SVD components and the other attributes, reached a Kappa of 0.672 in the cross-validation, and 0.654 in the LB.
One of my favorite models, but not usually the best choice for sparse data in high dimensions. Anyway, after several tests, I got 0.6695 in cross-validation, and 0.6508 in LB using a TF-IDF matrix with the other attributes.
One reason that may help explain the fact that it behaves better with the TF-IDF matrix than with the SVD matrix is that this is a method that reduces variance, so it turns out to be robust even with many attributes, as long as they are relevant to the prediction.
Gradient Boosted Decision Trees
Although this model is also not very common for use with sparse data in high dimensions, I decided to train several “weak” models and stack using their predictions and the other attributes, which did not involve a representation of the words.
The stacked models had an average of 0.60 Kappa in validation.
The score for this model was around 0.624 on LB. But the real intention with it was to have a different model to ensemble.
In all of the Top 10 solutions I read, there was one detail in common: post-processing predictions. Teams used special methods for rounding, multiplying by coefficients, or determining different cut-off points for predicting each “class”.
In one case, the team ordered the predictions and, according to the proportion of each class in the training data, rounded the forecasts. Kappa is a metric sensitive to the distribution of predictions, so this ended up helping.
The closest I got to an adjustment in the predictions was to use the geometric mean instead of the simple average for the ensemble. The models used to overestimate the relevance of the products, and this helped bias the predictions to the lower classes.
Ensemble and Overfitting
In the end, I made a geometric mean with my best models. The ensemble consisted of several SVMs, a Random Forest and a Gradient BoostedD Decision Trees.
Although not a good practice, especially in cases with a small amount of data like this competition, I used LB to validate the ensemble. This ended up making me choose my second best ensemble, yet I ended up in the Top 5% of the competition.