Short description
The author of this article explains how to speed up binary search and search for data in plain text files without relying on traditional databases. The article covers methods of optimization, including caching and avoiding unnecessary file reads. The article provides code examples and offers a comparison of the speed of traditional databases versus using a search system developed with these techniques. The author created a Telegram bot to search for data from various sources using this system.
How to speed up binary search
Welcome to the Habr community. I decided to talk about how to speed up the usual binary search by hundreds of times and search for data in a plain text file FASTER than when using classic databases. Now I will try to solve the problem of binary search without them, I will talk about the main methods of optimization, and at the end I will make a comparison. This is a very real task that I will face when developing my own project, and therefore I have something to tell you.
Organized binary search is very simple – you sort the array of data alphabetically, and then look for the necessary information in it in the same way as in a paper dictionary – open the book in the middle and see from which side the word you are looking for is on the right or left. Then open the desired part in the middle and again see where to move. And so on until you find the right word.
Thus, it is possible to find the necessary information in logarithmic time, but only in theory.
def binary_search(arr, x):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == x:
return mid
elif arr[mid] < x:
left = mid + 1
else:
right = mid - 1
return -1
arr = [1, 3, 5, 7, 9]
x = 5
print(binary_search(arr, x))
>>2
In fact, we usually do not have access to the entire data array. Binary search can be effectively used when there is a lot of data and the RAM cannot accommodate it. And even if it were possible, they would have to be read from the disk, which would take a lot of time. Fortunately, most programming languages have a wonderful seek() function that allows you to navigate anywhere in the file being read. At the same time, the transfer time does not depend on the distance to be moved (if the file is not defragmented). This means that we can implement a binary search inside the file without reading it.
But you may have a question – for binary search you need to move to a line with a certain number. If the lines are the same length, the problem is purely mathematical. But what if not? Is it necessary to supplement them to the same length with spaces? Store the length of the rows in a separate index? Of course not.
When you look up a word in the dictionary, you don’t need to open it on a specific page. It is enough to be able to open it about in the middle. Indeed, if we read the file in the middle, it does not matter to us that there are 995 lines on the left, and 1005 on the right. It is enough to know that we need to move to the left, and we will not miss the right line.
Also, you may have another question – let’s say we moved to the center of the file. How do we count one row? We could get to its center and we don’t know where the beginning is.
The easiest solution is to move one character to the left until we reach the newline character. And then count the line completely through readline.
n = 10000000
with open('text.txt', encoding='utf-8') as file:
file.seek(n)
while True:
n = n-1
file.seek(n) # перейти на один символ назад
char = file.read(1)
if char == '\n' or file.tell() == 0: # если дошли до конца строки или начала файла
break
S = file.readline()
print(S)
However, this solution has an inconspicuous problem – HDDs don’t really like to read “backwards”, so it will work slower in the case of large lines. It is much easier to do this:
n = 10000000
with open('text.txt', encoding='utf-8') as file:
file.seek(n)
file.readline()
print(file.readline())
Yes, you can simply skip the line we got into and work from the next one. While it is true, binary search will not be able to find the very first line in the file, but we can put column headers there and say that this is not a bug, but a feature. Although one bug should be fixed:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte
The fact is that we get to a random byte of the file. And some Unicode characters consist of several bytes, and when we get “inside” such a character, we get an error. They occur quite rarely, since most characters are single-byte, and therefore such problems can simply be ignored. To do this, we can specify the errors=”ignore” parameter to the file opening function. Or use a try – except statement and some code. Something like that:
n = 10000000
with open('text.txt', encoding='utf-8') as file:
file.seek(n)
while True:
try:
file.readline()
break
except UnicodeDecodeError:
n=n+1
file.seek(n)
print(file.readline())
How can you speed up the search? Let’s pay attention to which files we are reading. At each search, we read the line in the center of the file. In half of the cases, we read lines at 25% and 75% of the file. And so on. The same positions are read multiple times. And that’s bad. If we use an SSD, this approach can wear it out over time, and if we have an HDD, it takes a relatively long time to move from one position to another, since the read head needs to physically move to the right place.
But why do we need to read the center of the file and other key points on every search? You can count them once and remember them. This optimization method is called caching and can be implemented using the lru_cache decorator.
from functools import lru_cache
@lru_cache(maxsize=100000)
def get_data_in_position(file_name, point):
with open(file_name, encoding='utf-8') as file:
file.seek(point)
while True:
try:
file.readline()
break
except UnicodeDecodeError:
point=point+1
file.seek(point)
return(file.readline())
In this case, the maxsize variable is responsible for the maximum number of rows stored in the cache. After all, we don’t need it to take up all the RAM.
But caching does not end there. After all, now at each iteration of the search we open the source file again, which takes a lot of time. Why don’t we cache the open file variable the same way? Here, for the sake of variety, we will do without ready-made solutions to show that caching is not difficult at all.
open_file_list={} #кэш открытых файлов
def open_file_cashe(name_file):
global open_file_list
if open_file_list.get(name_file)!=None:
return(open_file_list.get(name_file))
else:
t=open(name_file,'r',encoding='utf-8',errors="ignore") #,'\n'.join(a[Min:])
if len(open_file_list)<10: #кэшируем до 10 открытых файлов
open_file_list[name_file]=t
else:
File = open_file_list.popitem()
File[1].close()
open_file_list[name_file]=t
return
# открываем соединение с базой данных
conn = sqlite3.connect('hashes.db')
cursor = conn.cursor()
# генерируем 10000 случайных чисел и ищем их хэши
total_time = 0
for i in range(100000):
# выбираем случайное число
num = random.randint(1, 1000000)
# получаем хэш числа
hash_value = hashlib.md5(str(num).encode()).hexdigest()
# засекаем время перед запросом
start_time = time.time()
# выполняем запрос по хэшу
cursor.execute('''SELECT id FROM hashes
WHERE hash_value = ?''', (hash_value,))
# получаем результаты и добавляем время запроса к общему времени
result = cursor.fetchone()
end_time = time.time()
total_time += end_time - start_time
# закрываем соединение
conn.close()
# выводим общее время запросов
print(f'Total time taken: {total_time} seconds')
def find_hashes_csv():
# генерируем 10000 случайных чисел и ищем их хэши
total_time = 0
for i in range(100000):
# выбираем случайное число
num = random.randint(1, 1000000)
# получаем хэш числа
hash_value = hashlib.md5(str(num).encode()).hexdigest()
# засекаем время перед запросом
start_time = time.time()
find_in_file('hashes.csv',hash_value) #функция нашего поиска
end_time = time.time()
total_time += end_time - start_time
# выводим общее время запросов
print(f'Total time taken: {total_time} seconds')
Results:
In this way, we put together a search system on the knee, twice as fast as in real databases! And even more economical in terms of memory!
But there is a simple explanation for this – databases are a multitool, a universal tool with dozens of possibilities, with different search variations and with complex algorithms for adding and deleting data. During their development, the speed of work was sacrificed in favor of convenience and versatility. Therefore, a simpler algorithm may well be faster.
And so that you can make sure that it really works and I did not take the numbers from my head, I will tell you about a project in which I actually applied everything I talked about above. I created a Telegram bot to search for data from various sources. Such a bot is necessary for fraud, hacking fraudsters, checking one’s data, etc. Initially, I developed it for companies – to check user accounts en masse, identify those vulnerable to hacking due to leaks and force them to change their passwords, but now I have made it available to everyone.
When I implemented the search, the SQL database didn’t work for me because I don’t like SQL because the size of all the files is 2.5 TB and together with the indexes they simply did not fit on my disk. Therefore, I studied everything that I wrote about above and as a result implemented the robot as I wanted at first.
This bot can perform 20 queries per second with a database size of 40 billion rows! And all this exclusively on binary search. No SQL, plain text files and maximum optimization methods. Not only those that I have mentioned in this article, but also some others that I will write about in future articles.
And here is this bot, you can check how it works: DataLeakAlertBot
And I hope you found this article interesting. Ready to answer any questions you may have.
If you look events in Mykolaiv – https://city-afisha.com/afisha/