how the search for goods works in OberMarket / Habr
Hello everybody! My name is Hanna Vlasova. I work as an ML engineer in the Search team of Oschadmarket. In this article, I will tell you how our processes are organized: from the moment when the user asks, until we receive a search result. If you are developing search or just interested in the topic, you will probably find interesting insights for your work.
Briefly about what awaits you:
Areas of responsibility of the Search team;
How we select candidates for display in search results;
Final ranking of products by ML model.
7 out of 10 products in SberMarket are added to the cart directly from Search, so even small changes in our products have a big and visible effect on business. That is why we pay a lot of attention to the continuous improvement of our solutions and have already achieved good results. I hope you will learn something new from our case study and be able to apply it in your work. Let’s go!
Before sending a request
First, the user needs to decide how he wants to search for the product. On the main page, he can perform an inter-retailer search – there will be offers for all stores, he can compare the prices of goods and choose which store has what he is interested in. The user can also go to a specific store and search within it. And you can just scroll through the catalog, but in this article we will consider the search itself.
After the user starts entering a query, hints with relevant categories and offers will appear in front of him – these are also handled by our team.
So, the query is entered. For example, a buyer searches for “pepper” and receives this output.
What happened between these two events? We will find out soon, but about everything in order.
Selection of candidates
The process begins with the selection of relevant products and their preliminary ranking. But let’s first understand where and in what form all information about goods is stored. I will clarify right away that we use ElasticSearch as a search engine.
All information about goods are in the index – a repository of documents (goods) in the search engine. The index consists of several shards — small sub-indexes that structure the storage and help find the documents you need faster.
To form the Index, we first load products from our internal database into our search system. Then we perform data processing – delete stop wordsWe bring the names to the original morphological form, expand the fields with the help of synonyms, which are additionally collected.
Let’s say the name of the desired product is “Dobry cherry juice”. A synonym of the word juice is nectar. In the future, if the user enters the word “nectar” into the search engine, the system will also offer the buyer the product “cherry juice “Dobry””, as we have expanded the search field with a synonym.
We expand not only product names with synonyms, but also other fields, for example, brand names. In that case, it is mostly a transliteration, that is, a Russian-language variant of writing an English-language brand. For example, we also expand the bombbar brand with synonyms bombard, bomb bar, bomb bar. Synonym groups can be the same for all fields, or can be used only in a specific field.
So, the data is ready: the name of the product, the normalized version of the name, expansion with synonyms. Next is created reverse index and the data is sub-indexed for more efficient searching. Now it’s time to go to the stage of receiving a request from the user.
A request is coming
Here begins the process of selecting relevant products and pre-ranking, which consists of several parts:
First, we check whether the search request does not contain obscene language: for this, we run the request through our special internal service. If the request has been checked and no likes have been detected, we move on.
Then we carry out the text processing process that I described above, but not for the product, but for the user’s request, we remove stop words, perform morphological normalization, expand with synonyms. Moreover, synonyms for search queries may differ from those we use for product names. Elasticsearch has two parameters: Index time – for names at the indexing stage, and query time – for queries.
Next, we check whether the search query or its modifications (normalized query or synonyms) match the name of the product or its category, brand and other fields involved in the search. If there are matches, the product is considered relevant to the request.
If there are no matches, we check the request using the printer. Maybe it’s because the user made a mistake and we need to fix the invalid request. Then we repeat all previous steps.
If this did not help, we weaken the query and check again for matches between the search query and the product name.
I will explain what it means weakening the demand. For example, a user searches for Coca-Cola banana, but we do not have such a Coca-Cola banana-flavored product. Then we run a search for each of these words separately. The answer to the buyer’s request will be a strange mixture of Coca-Cola, bananas and banana products. But it is very important for us not to allow an empty return, because an empty return for search is the most unpleasant situation. We always want to show the user at least something, even if it is not 100% hit in the query. At the time of publication of this article, this functionality is disabled, but we are looking for an optimal solution and conducting experiments.
After the goods relevant to the request have been prepared and selected, they must be pre-sorted.
Text proximity and clickbait
We use two parameters to assess relevance and preliminary ranking:
text proximity – The match of the search query text with the product name or other fields that we also use for searching (more often it is a brand or product category). Sometimes it is necessary to use specific fields, for example, for pharmacies – the active substance field.
clickbait — we consider how often the product was purchased based on this search query.
The higher the value of text proximity and clickbait, the higher the product will be in the preliminary ranking. We use the score function to evaluate text proximity BM25 – A more advanced version of the popular TF-IDF algorithm, its main difference is that BM25 takes into account more parameters, adds coefficients and smoothing to them, giving more accurate results.
After analyzing the text proximity, we understand that, for example, “Prostokvashino milk” does not have any text match with the search query “pepper”, so it is an irrelevant product. But the names of the products “Red pepper”, “Ground black pepper” and “Bulgar pepper” match the search query. Therefore, we consider them relevant based on textual proximity, soon these products will be much higher than others.
As a result, the clickbait rate is multiplied by the BM25 score. The top 300 products received at this stage will go to the next stage – ranking using the ml model.
Ranking by ML model
We use XGBRanker’s gradient boosting model. We transfer goods to him with their key features:
User interaction with the product.
Availability and size of the discount.
Popularity (purchase) of the product.
As you know, the ranking model, in addition to features and target labels (that is, the target), also needs to transmit an additional parameter – qid. It is according to this parameter that all input data is divided into groups, within which the ranking takes place. We break down the search session data to train the model. One session is one request in the store selected by the user at a certain point in time.
Preparation of the target
Let’s start with the target or, in other words, the target label. We assign it a value from 0 to 4 according to the following principle:
4 – if the product was added to the cart by the user and purchased.
3 — if the product was added to the cart, but the purchase was not completed for some reason.
2 – if the product was added to the cart, but then deleted.
1 – if the user opened the product card, but did not add it to the cart.
0 – if the product was in the search result, and the user ignored it, but here it is a bit more complicated.
In addition to event data, we take those products that could appear to the user upon request, but in fact he did not see them. We send a search request to our engine, and at the output we receive a list of products. We select the number of these “irrelevant” products in such a way that each issue contains 100 products. This approach was chosen because it had a positive effect on quality metrics, because made the model more robust to different input data.
Collection of signs
As features for the ranking model, we use various statistics. It the popularity of the product within the framework of the request, that is, the number of purchases from a search result (or the ratio of the number of purchases to getting into the result). In addition to purchases, we consider similar statistics from viewing the product card and adding to the cart.
I would like to separate it into a separate group user interactions with productsbecause the features of this group make the greatest contribution to the quality of the model.
We also take into account the different interactions of users with products within a store, retailer, region or the entire Oschadmarket.
Very often, users come to us for profit, so the features calculated on the basis of price. availability and size of the discountComparison of the current price with the historical price also strongly affects the quality, these characteristics are calculated online, that is, in real time.
After the target and features are prepared, all the data goes into the model. For offline testing, we use NDCG@k and MAP@k ranking quality metrics.
Rating of important features
Now let’s move on to the most significant signs and discuss the first ten.
As many as three positions among the first top 10 are occupied by the sign. the ratio of product additions to impressions in the search output within the retailer/region/the entire Oschadmarket. Let’s say a user searched for peppers in Metro. We look at how often a certain product, for example, red pepper, was added to the cart for the request “pepper” and divide it by the number of impressions of impressions of this product in other editions.
The second group of significant features are discounts: their availability and amount in rubles.
Also in the top will be “user signal” – this the number of sentences the user of this product and relative addition: out of all the user’s orders, we select the share of those in which this product was found. I would like to clarify here that the buyer can add it not only from the search engine, but from the catalog or other sections. We take into account how much he loves this product, and it doesn’t matter where he adds it from.
It also enters the top 10 the popularity of the product in a particular store — is taken into account as the absolute value of orders, and the share of orders in the store where this product was found.
Search is a complex system that works with natural language, so it can sometimes produce funny and not very correct results. It is important to consider such cases and constantly update products and models.
And may you always find exactly what you are looking for!
Product&data team of SberMarket maintains social networks with news and announcements. If you want to know what is under the hood of highly loaded e-commerce, follow us on Telegram and YouTube. And also listen to the podcast “For tech and those” from our IT managers.
If you look events in Mykolaiv – https://city-afisha.com/afisha/