Quick search for typos on Rust

Quick search for typos on Rust

We launched our Hacker News search engine and RAG engine with a half-baked bug fix system. In the first version, it took more than 30 ms to process spelling-correct queries. This is quite a lot, so we have disabled this feature by default. Our newest version is 100x faster, handles correctly written queries in 300ms and spends ~5ms/word on error correction. In this post, we will explain how we managed to achieve this!

Examples of sample requests

If you want to experiment with correcting typographical errors on the hn.trieve.ai website, click on the following links:

We create a dictionary of lexemes taking into account the frequency

When working with small data sets, this task is simple. It’s possible to flip through text arrays comparable in size to about 1,000 HN posts in ten seconds, with just one workflow and word splitting functionality. However, as it grows to the sizes handled in the Hacker News Demo dataset (38M+ posts), the work needs to be distributed.

In the end, we decided to compile the dictionary with the help of 2 separate working programs:

1. Using a Cronjob, scroll through all the documents that are in the search results of each of our users and add fragment IDs, transferring them from our database to the Redis queue 500 pieces at a time.

2. Using the Word worker, elements are pushed out of the queue, while the program processes 500 fragments at a time. The text of each fragment is pulled, divided into words, after which each word is uploaded to Clickhouse.

We decided to store the dictionary in ClickHouse when we started experiencing deadlocks and performance issues when writing to Postgres. These problems increased as the number of workflows increased. ClickHouse’s asynchronous inserts are fantastic for this task, and we were able to ingest our entire 38 million document dataset in less than an hour.

We detect and correct typographical errors using the BK-tree data structure

We chose a standard typographical correction approach and built Burkhard-Keller trees for each dataset. They help to efficiently compare words in a search query and a dictionary of a large amount of data using an algorithm with O(log N) time complexity. A detailed explanation of this data structure is beyond the scope of this post, but you can read about our Rust implementation here or refer to the wiki source.

We built BK-trees using the third worker thread bktree-worker. It takes complete datasets with dictionaries stored in Clickhouse, and then builds a tree based on word and frequency data.

Once the BK-tree is built, the workflow stores it in Redis so that it can be efficiently loaded into the API server’s memory as soon as the first request for a particular dataset arrives.

This task turned out to be non-trivial when working with very large datasets, where the size of the tree is calculated in hundreds of megabytes. We were constantly experiencing latency when running Redis in the read and write process. We’ve developed a serialization method that flattens and archives data, making it more compact in Redis and reducing pull and send latency.

We write business logic to correct typos

The API server side runs our typo_operator, which we have specially optimized, increasing the processing speed to 300ms for spelling-correct queries and up to 10ms/word for queries with errors.

Pulling data from Redis

Pulling a large data structure like a BK-tree for HN from Redis takes 300 µs. It non-viable approach, if such an operation is performed for each search request, therefore, on the server side, we implemented a cache layer to store in it BK-trees that have already been pulled up once. This is done with the help of lazy_static!

lazy_static! {
    static ref BKTREE_CACHE: BKTreeCache = BKTreeCache::new();
}

At the first search attempt with typo tolerance enabled, we predicted a cold start of about 200-400 ms in order to have time to pull the BK-tree of the requested dataset from Redis into the server’s memory. This operation is followed by search operations that use a BK tree to check for typos. Each such check takes only 100-300 μs.

Identifying English words

1. Preliminary identification of English words

Since our BK-trees consist exclusively of dataset-specific dictionaries, they may not contain all English lexemes that actually exist. In order to prevent unnecessary correction of real words missing from the tree, we foresee a preliminary stage at which English words will be identified:

  • In RAM, we keep a hash set of approximately 400,000 English words stored using English lazy_static!

static ref ENGLISH_WORDS: HashSet = {
        include_str!("../words.txt")
            .lines()
            .map(|s| s.to_lowercase())
            .collect()
    };

2. Analysis of affixes

Then we check – suddenly we have a simple English word with a prefix or suffix:

static ref PREFIX_TRIE: Trie = {
        let prefixes = vec![
            "anti", "auto", "de", "dis", "down", "extra", "hyper", "il", "im", "in", "ir", "inter",
            "mega", "mid", "mis", "non", "over", "out", "post", "pre", "pro", "re", "semi", "sub",
            "super", "tele", "trans", "ultra", "un", "under", "up",
        ];
        Trie::new(&prefixes)
    };
    static ref SUFFIX_TRIE: Trie = {
        let suffixes = vec![
            "able", "al", "ance", "ation", "ative", "ed", "en", "ence", "ent", "er", "es", "est",
            "ful", "ian", "ible", "ic", "ing", "ion", "ious", "ise", "ish", "ism", "ist", "ity",
            "ive", "ize", "less", "ly", "ment", "ness", "or", "ous", "s", "sion", "tion", "ty",
            "y",
        ];
        Trie::new(&suffixes)
    };

1. For each word in the query, we conduct a search on these lines and find the longest suitable prefix and suffix.

2. Then we cut off these affixes from the word, leaving only the root.

3. After this operation, we perform the final dictionary check:

4. We look for the root in the dictionary corpus of the English language.

fn is_likely_english_word(word: &str) -> bool {
    if ENGLISH_WORDS.contains(&word.to_lowercase()) {
        return true;
    }

    // Проверка префикса
    if let Some(prefix_len) = PREFIX_TRIE.longest_prefix(word) {
        if ENGLISH_WORDS.contains(&word[prefix_len..].to_lowercase()) {
            return true;
        }
    }

    // Проверка суффикса
    if let Some(suffix_len) = SUFFIX_TRIE.longest_suffix(word) {
        if ENGLISH_WORDS.contains(&word[..word.len() - suffix_len].to_lowercase()) {
            return true;
        }
    }

    // Проверка составных слов
    if word.contains('-') {
        let parts: Vec = word.split('-').collect();
        if parts
            .iter()
            .all(|part| ENGLISH_WORDS.contains(∂.to_lowercase()))
        {
            return true;
        }
    }

    false
}

4. Search for non-dictionary words in BK-trees

If some words did not pass the check against the corpus of the English language, we start a search on the BK-tree:

1. We query the BK-tree in search of the closest matching words.

2. We generate a set of “corrected” versions of each non-dictionary word.

let mut best_correction = None;
            let mut best_score = 0;

            for ((correction, freq), distance) in tree.find(word.to_string(), max_distance) {
                if distance == 0 {
                    best_correction = None;
                    break;
                }
                if !is_best_correction(word, correction) {
                    continue;
                }

                let score = (max_distance - distance) * 1000 + *freq as isize;

                if score > best_score || best_correction.is_none() {
                    best_correction = Some(correction);
                    best_score = score;
                }
            }

            if let Some(correction) = best_correction {
                corrections.insert(word, correction.to_string());
            }

5. Selection of edits

When we have compiled a set of possible corrections, we will use the scoring algorithm to choose the best editing option:

Our algorithm primarily takes into account the coincidence of prefixes, and also takes into account the frequency of each candidate word within the dataset.

fn is_best_correction(word: &str, correction: &str) -> bool {
    // Фильтр  на основе длины
    let len_diff = (word.len() as i32 - correction.len() as i32).abs();
    if len_diff > 2 {
        return false;
    }

    // Совпадение по префиксу (откорректировать длину, если потребуется)
    let prefix_len = std::cmp::min(1, std::cmp::min(word.len(), correction.len()));
    if word[..prefix_len] != correction[..prefix_len] {
        return false;
    }

    // Сравнение наборов символов
    let word_chars: HashSet = word.chars().collect();
    let correction_chars: HashSet = correction.chars().collect();
    let common_chars = word_chars.intersection(&correction_chars).count();
    let similarity_ratio =
        common_chars as f32 / word_chars.len().max(correction_chars.len()) as f32;

    similarity_ratio >= 0.8
}

Unexpected benefits

User Levitating told HN that the request FreeBSDsorted by score returns irrelevant results. But our tokenizer breaks up words with camel case in mind, so FreeBSD turns into FreeBSD FreeBSD. Stories that contain only “Free” have many more ratings than any that contain “FreeBSD”, so the former are ranked higher. This opens up the possibility for whole-word validation at the level of the entire dataset, where non-English words can be automatically requested. In this case, the request FreeBSD turns into “FreeBSD”.

Ideas for the future

Relying on this same system, we plan to implement the functionality of splitting and linking requests, since in these operations the requirement to quickly look up lexemes in the dictionary is equally relevant.

If you look events in Mykolaiv – https://city-afisha.com/afisha/

Related posts