Optimization of the search for ads by the dates of daily rental booking
Hello everybody! My name is Azamat, I am a backend developer at Tsian, I work in search services. In the article, I will tell you how our team optimized the search for ads by booking dates in the daily rental section. How we solved the problem of increasing CPU consumption, accelerated the search itself and made the iron cheaper.
The article will be especially useful for those who want to learn more about how elasticsearch works, are engaged in the development and maintenance of search services, and who need to optimize non-trivial searches.
Contents
Problem
At the beginning of the year, we expanded the ability to search for daily apartment rental ads. Added a filter for ads by available dates and by price on those dates. That is, it became possible to search for vacant housing for the specified period and satisfies the specified price range for this period. When everything was done, we discovered one big problem – such a search heavily loaded the CPU of the elasticsearch search index much more than when searching without these filters. It was necessary to figure out how to reduce the load.
There were also limitations. Each listing has a calendar with available dates and prices for those dates six months ahead.
An additional complication is the frequent updating of ads. With each reservation or the appearance of new free dates, you need to update the calendar, and do it quickly, so that the user sees relevant data when searching.
The initial solution is a nested field
Before we started the optimization, another suboptimal solution that we did as a quick experiment was already working. At the time, we didn’t want to spend a lot of time on performance research. Let’s consider this solution to understand its shortcomings.
A nested field has been created that stores a list of all available dates and prices. It looks like this:
# документ в elasticsearch с объявлением
{
...
"available_dates": [ # поле с типом nested
{"date": "01-09-2022", "price": 1500},
{"date": "02-09-2022", "price": 1600},
... # таких объектов до 120 штук
]
...
}
And accordingly, if we want to find ads for the period from September 1 to September 3 at a price of 1,000 to 2,000 rubles per day, then we look for ads that have all available dates for this period. It looks like this:
{
"query": {
"bool": {
"must": [
{
# Условие для 1-го сентября
"nested": {
"path": "available_dates",
"query": {
"bool": {
"must": [
{"range": {"available_dates.price": {"gte": 1000, "lte": 2000}}},
{"term": {"available_dates.date": "01-09-2022"}}
]
}
}
}
}
{
# Условие для 2-го сентября, в условии отличается только дата
"nested": {
"path": "available_dates",
"query": {
"bool": {
"must": [
{"range": {"available_dates.price": {"gte": 1000, "lte": 2000}}},
{"term": {"available_dates.date": "02-09-2022"}}
]
}
}
}
}
# 3-го сентября нет, потому что это дата выезда, в этот день может заселиться другой человек
]
}
}
}
In this case, for such a request, we cannot use another type of field except nested, because it is important for us to preserve the connection between the free date and the price for this date. If we use a simple “object” type, this connection will be lost. The fact is that elastic displays all arrays of objects in simple lists. You can read more about it here.
Cons of this decision
The main problem is the growing number of deleted documents in the index. Before adding reservation dates, the number of documents in the index was about 3 million. After the addition, there were 10-12 million documents (this is not the number of ads). Due to this, the load on the CPU increased significantly and the cluster stopped coping with the load.
I would like to emphasize that the documents are not only separate announcements, but also separate reservation dates. I write about this below.
How did it happen? Why has the number of deleted documents increased? To understand this, you need to understand:
-
how ad reindexing works
-
how nested documents are stored
-
how segments are formed and how they are cleaned
How ad reindexing works
When a document is updated in Elastic, 2 operations are actually performed:
-
The document in the index is marked as remote (! is not deleted, but only marked as remote)
-
A new document is created with new data
A document marked as remote is stored in memory until the segments in the index are merged.You can read more about it here.
How nested documents are stored
Each object stored in a nested field is a separate document at the lucene level (documentation)
In this scheme, each ad is 1 document for the fields of the ad itself + 120 documents for each booking date with a price (because we store 120 booking dates). So when we update one ad that stores 120 booking dates, the following operations actually occur:
-
121 documents are marked as remote (1 announcement + 120 reservation dates)
-
121 new documents with new values are created. Moreover, if we update only one reservation date, 121 documents are still reindexed.
How segments are formed and how they are cleaned
Segments are blocks in which documents are stored. They are created when new documents are added. When a segment reaches a certain size, it stops being filled, a new segment is created instead. After a certain number of segments are reached, they begin to merge. At the time of merging, documents marked as remote are removed from the segments.
Here you can see a visualization of how the segments are formed and combined.
We collect all together
When we update reservation dates or update other data in ads, a huge number of documents are marked as removed. In the case of a daily rental, this is often done. Users often book specific dates and become unavailable.
Due to the fact that the flow of deletion announcements is large, the number of deleted documents in the index is huge. These documents will not be deleted until the segments are merged. Due to this, the complexity of the search begins to increase, the load on the CPU begins to increase significantly. In addition, the load on the indexing process increases.
Therefore, it was decided to get rid of the nested field, and filter through the painless script.
A new solution
Another solution was invented – to store an array with a mask of the availability of booking dates + an array with available prices + the serial number of the day from which the countdown starts. We will filter through the painless script. These fields look like this:
calendar_date_availability = '001101' # массив доступности дат начиная с 10 сентября
calendar_price_availability = [0, 0, 1000, 1500, 0, 1000] # массив цен за каждую дату начиная с 10 сентября
booking_start_day = 9 # количество дней после 1 сентября 2022 года
calendar_date_availability is a text field, 1 means that the date is available for booking, 0 is not available. I will explain why the text field itself.
calendar_price_availability – This is an array of numbers, each element is the price for the booking date. The ordinal index of the element corresponds to the calendar_date_availability symbol.
booking_start_day is a field thanks to which we will be able to understand from which date the mask calendar_date_availability and the price array calendar_price_availability start inside the painless script. To calculate it, we took a conditional date from which days are counted – September 1, 2022. At the time of indexing, we count the number of days from this conditional date. Then, when we perform a search query, we can understand inside the script what date the list of available dates in calendar_date_availability starts and filter the ads by the dates we want.
Mappings of these fields:
{
"properties": {
"props": {
"properties": {
"calendar_date_availability": {"type": "text", "fielddata": true},
"calendar_price_availability": {"type": "integer", "store": true},
"booking_start_day": {"type": "integer"},
}
}
}
}
The Painless script that filters by these fields looks like this:
# Определяем начальный и конечный индекс, в котором проверяем доступность дат для данного объявления
int start_index = params['start_days_from_point'] - (int) doc['props.booking_start_day'].value;
int end_index = params['end_days_from_point'] - (int)doc['props.booking_start_day'].value;
char[] dates_availability = doc['props.calendar_date_availability'].value.toCharArray();
# Проверяем, что есть информация о доступности за все искомые дни
if (dates_availability.length <= end_index) {
return false;
}
# Проверяем, что все даты за искомый период доступны
for (int i = start_index; i <= end_index; i++) {
if (dates_availability[i] == (char)'0') {
return false;
}
}
# Считаем среднюю цену за указанный период и определяем, попадает ли она в запрашиваемый диапазон цен
int min_price = params['min_price'];
int max_price = params['max_price'];
def prices = params._fields['props.calendar_price_availability'].values; # доступ к store полю отличается
int total_price = 0;
for (int i = start_index; i <= end_index; i++) {
total_price += prices[i];
}
int average_price = total_price / (end_index - start_index + 1);
if (
average_price >= min_price &&
(average_price <= max_price || max_price == 0)
) {
return true;
}
return false;
Nuances
We cannot use standard mappings for calendar_date_availability, calendar_price_availability fields. There are several nuances:
Simple arrays do not preserve the order of elements during the search phase (documentation). Therefore, calendar_date_availability has type text, and in the script it expands to characters. In order to access the text from painless, the “fielddata”: true flag must be set. You can consider other options:
-
Store the boolean array in the field with the stored flag in the map. This method increases the search time due to the fact that access to such a field becomes several times longer than to a normal one.
-
Newer versions of elastic have a dense-vector type that preserves the order of elements. But since it is not in our version of elastic research, this option did not suit us. This field also has a fixed length, if we want to increase it, we will have to start a new field.
-
Store object with keys ‘0’, ‘1’, ‘2’ . In our case, there were 120 reservation dates, so the number of fields would be very large, and it is not known what consequences this could lead to. In addition, the number of days stored may increase.
The calendar_price_availability field has the “store”: true flag in the mapping, and this allows you to get the original order of the items. But you have to sacrifice speed. When filtering based on prices, the search takes 5-10 times longer. In the script itself, such a field is accessed differently.
Using a similar approach as with booking dates (use string type) failed. It would be necessary to float the line along the separator. This in itself is not faster than a long access to the store field.
To reduce the impact of long access to prices through store fields, we have made optimization. At the indexing stage, we determine whether the price is the same for all dates, and if so, we record it in a separate field. At the search stage, we see if there is a single price for all dates. If so, then we avoid accessing the store array and filter only by a single price. Due to the fact that 90% of ads have the same price for all booking dates, this optimization gives a significant speed increase (about 5x) when filtering ads by price.
For unavailable booking dates, the value 0 is stored in the array with prices. You should not store null, otherwise all null values will be combined into one, and the array may turn out to be of a different length. Then you won’t be able to search for matching prices in the array by the same indexes.
As a result, we were able to get rid of the problems that were in the solution with the nested field, i.e. got rid of a huge number of deleted documents in the index and reduced the CPU load to almost the same values as before adding the new filters.
Parent / Join was also considered
Parent / Join is a mechanism that allows you to request documents taking into account the relationships between them. It is somewhat similar to a join query in relational databases, but has a number of limitations.
To perform such a search, you need to specify a join type field in the index map. Child and parent documents are in the same index. In the mapping, you need to specify what relationships the documents can have, for example {ad} -> {booking date}. When indexing, you need to specify the type {ad} for the parent document, and the type {booking date} and the id of the parent ad for the child document. We can then search and aggregate documents based on this relationship.
For example, for our case, we could create a {ad} -> {booking date} relationship for the ads and separately index the booking dates as separate documents with a link to the parent document. Then search through has_child. That is, request ads that have the desired booking dates and the corresponding price for those dates.
But this approach has several limitations, due to which it was decided to abandon it:
-
Since child documents are the same as parent documents, their ids may overlap. So you need to figure out how to generate IDs for each booking date so that they don’t overlap with ad IDs.
-
Announcements and dates must be on the same shard (it is necessary to control the routing when indexing the reservation dates (routing is the id of the announcement), the index should be single_type. You will have to create a new index with the setting single_type = true (this setting says that the index can have only one type of document mapping, in the 7th and newer versions).
-
Reservation dates must be indexed separately from advertisements.
It is not known what results this approach can give, because we have not experimented with it. But there is concern that it may be less efficient than nested field search, since nested documents are in one lucene block, and parent/child documents are only on one shard and searching them is more difficult (discussion on this topic). If you’ve had experience using parent/child in a similar task, write about it in the comments.
Result
We have found the optimal solution in terms of research time, implementation, and benefits. The optimization made it possible to reduce the load on the search index by about a third, due to which we reduced the cost of the search cluster. At the same time, the speed of search queries was maintained and the number of incidents related to sudden increase in CPU consumption and increased disk reading was reduced.
If you look events in Mykolaiv – https://city-afisha.com/afisha/