The Secret Behind one of the biggest online marketplace “OLX Group” How do they utilize their data and machine learning algorithm?
OLX Group is a global online marketplace operating in 45 countries and is the largest online classified ads company in India, Brazil, Pakistan, Bulgaria, Poland, Portugal, and Ukraine. It was founded by Alec Oxenford and Fabrice Grinda in 2006.
A platform that connects buyers and sellers in more than 40 countries and has hundreds of millions of customers per month faces many challenges that are to some extent similar but also somewhat different to online retail.
There are main 3 challenges:
Challenge 1: User experience. When the user navigates the platform and what are the recommendations and the results when doing searches, etc.
Challenge 2: Identifying what is that makes some advertisements much more liquid (easy to sell) than others.
Challenge 3: The reminder after purchasing the items. Predicting if an item is sold 15 days after its entry into the system.
As part of the solution for a good user navigation and browsing experience, it is useful to have a good estimate if a specific advertisement has been already sold so that we don’t show it again in the recommendation or search output. This is a probabilistic time-series prediction problem. Another important aspect connected to the previous case is identifying what is that makes some advertisements much more liquid (easy to sell) than others. For this particular case, understanding how the model is making decisions is really important as the outcome can be provided to the sellers in order to improve the liquidity of their advertisements. For the reminder of we will focus on this specific liquidity prediction problem, predicting if an item is sold 15 days after its entry in the system, and we will use XGboost and eli5 for modelling and explaining the predictions respectively.
XGboost is a well known library for “boosting”, the process of iteratively adding models in an ensemble of models that target the remaining error (pseudo-residuals). These “week learners” are simple models and are only good at dealing with specific parts of the problem space on their own, but can significantly reduce bias while controlling variance (giving a good model in the process) due to the iterative fitting approach followed in constructing this type of ensemble. The data we have available for this problem include textual data (the title and textual description of the original advertisement, plus any chat interactions of the seller and potential buyers), as well as categorical and numeric data (the category of the advertisement, the brand and model of the item, the price, number buyers/sellers interactions for each day after the entry, etc.). The data sample we are using here is a relatively small part of the data from some countries and categories only, so in many of its properties it is not representative of the entire item collection. Nevertheless, let’s start with some basic data munging.
The histogram of the day that an item was sold is shown above. We can easily see that most items are sold in the first days after the respective advertisement is placed, but there are still significant sales happening a month later as well. With respect to the day an advertisement is added to the platform, we can see that there is a peak on weekends, but other days are roughly at the same level. Finally, with respect to the hour, an advertisement is added to the platform, we can see in the figure below that there is a peak around lunchtime, and the second peak after work hours. One way to capture more complicated relations is to use the pairplot functionality of the seaborn library. In this case we will get the combinations of scatterplots for the selected columns, while in the primary diagonal we can plot something different, like the respective univariate distributions. We can see that the number of buyers interaction in the first day is a strong predictor if an item will be sold early or late. We can also see that category id is very important predictor as well, as some categories in general tend to be much more liquid than others. Now that we are done with the basic data munging we can proceed to make a model, using the XGboost library.
Using a hyperparameter optimization framework we can find out the hyperparameters that work best for this data. Since we are interested also on the output confidence of the prediction itself (and not only on the class), it is typically a good idea to use a value for min_child_weight that is equal or larger than 10 (given that we don’t loose in predictive performance) as the probabilities will tend to be more calibrated. The ranking of the features from the XGboost model is shown in the figure above. Although feature ranking from tree ensembles can be biased (favoring for example continuous or categorical features with many levels over binary or categorical features with few levels) and in addition if features are highly correlated the effect can be splitted between them in non-uniform way, this is already a good indication for many purposes. Now we select one specific instance at prediction time. Using eli5 we get an explanation of how this instance was handled internally by the model, together with the most features that where the most important positive and negative influences for this specific sample.
As we can see the sample was classified as being liquid, but still there was some pull down from the text properties (title length, title words, etc) which we can use to provide guidance to the seller for improving the advertisement.