Using Machine Learning to Discover Illegal Ads in Russian, While Not Knowing Russian

The site Avito.Ru, founded in 2007, is one of the 5 largest Russian classifieds websites. It is used by both individuals and businesses to negotiate new and used products. One of the biggest challenges is maintaining the quality of content in the face of growth and the volume of new ads. To resolve this, the site decided to seek a solution based on the available data.

This was a competition ran on the Kaggle platform.

For readers with technical knowledge, the final solution code is available at this link (github).

Available Data

In the available database was the content of ads, attributes about the advertised product, category and sub-category, title and description. The variable that needed to be predicted was binary: whether the ad would be blocked or not by moderation. About 4 million ads with predictions given by human moderation were made available for training, and 1.5 million for testing.

In addition there was a variable indicating whether that ad had been blocked by an experienced moderator, which could make a difference in the human error factor.

Evaluation Metric

The metric used to evaluate the best solution was AvgPrecision @ K (Top at K). After assigning a score indicating the likelihood of a particular ad to be breaking the rules, they should be ordered so that the most likely were higher in the rankings. Once this was done, the system considered the top K ads and compared them with the actual true values to decide the correct hit rate of the model.

Data Transformation

My solution involved only the title and description of the ads. First the basics, put all the words in lowercase, remove stopwords and stem. One detail is that the ads were in Russian, so I do not know exactly which stopwords have other meanings (in English, “can” is a stopword, but it’s also used as a noun). Either way, these procedures improved the model’s performance. Also, I turned the documents into a numeric matrix where each line was an ad, and each column a word. For the element values I tested three variations:

– Binary: where the presence of the word in the ad was indicated with the number 1;
– Count: each value was the number of times the word appeared in the ad;
– Tf-Idf: each value was based on a formula that takes into account the frequency of the word in the ad, and also in relation to other ads. It assigns a higher value to rare words in the overall context, but which have a high frequency within the specific ad.

Among these alternatives, the one that demonstrated the best performance was the Tf-Idf. This is a technique widely used for text classification, and generally shows improvement in classification accuracy over the other options.

Despite making some attempts to clean the data, like removing the rarest words, the performance was negatively affected. What helped a little was removing numbers.

Solution

My focus was to create a simple solution, so my first trial was with Logistic Regression on all data. One of the factors to take into account was the distribution of positive and negative examples, which was not 50% for each. With this solution the accuracy was 91.8%.

After testing a few options available in scikit-learn (machine learning library in Python that I used), I found that using the “modified_huber” loss function created a more accurate model. This is a more robust function than log loss, since it applies quadratic penalty for small errors, and linear for large ones.

Another idea that helped a lot was to separate ads by category. Each category had a different proportion of positive and negative examples (Some with less than 1% and others with more than 30%). Applying the above algorithm to the transformed data in this way got 97.3% score. An improvement of 10.6%.

For the final solution, I also trained a Naive Bayes model that, despite assuming that the variables are independent, has a good performance for text classification. By averaging the solution of the two algorithms, I achieved the final score of 97.9%.

Differences to the winning solution

Comparing my solution with the solution of the winning team, there is a small difference of 0.8% in the score. But when we look at complexity, the winning solution used more than 30 models, between transformations and classification, for each example of the training set. In practice, as is normally the case in these competitions, it would not be worth implementing the winning solution. But this does not take out credit from the winners, and also the opportunity to learn.

Leave a Reply

Your email address will not be published. Required fields are marked *