search for incomplete duplicates

search for incomplete duplicates

We need to look for incomplete duplicates.

Data analysis can cause problems if there are duplicate rows in the DataFrame.

The easiest way to detect and remove duplicate strings is to drop them with Pandas using the drop_duplicates() method. But how to find incomplete duplicates without marking all text pairs and avoiding false-positive errors?

We needed such an ML algorithm that scales well and works with borderline cases, for example, when the difference between pairs of texts is only one digit.

I deal with natural language processing tasks at Gazprombank. Together with DVAMM, in this post we will tell what deduplication methods we use and what problems we encountered in practice when detecting incomplete duplicates.

Semantic Text Similarity VS Near-duplicates Detection

Any incomplete duplicate search task is always an STS task. But not every STS task is a task of detecting incomplete duplicates.

What is the difference?

STS is a search for pairs of texts that are similar in content. They may be structured differently, have vocabulary differences, but they will be united by a common theme.

As for the task of finding near-duplicates detection (near-duplicates detection), it is a search for texts that duplicate each other with minor changes. But what do we mean by minor changes? The concept is rather vague, and it is a setting that needs to be defined by the developer together with the business when solving a specific task. In the following, “duplicate” refers to incomplete duplicates.

It turns out that these tasks are similar, but the methods that work in STS lead to an apparently high number of false positive predictions in the deduplication task.

We need the search for incomplete duplicates for:

  • Deduplication of datasets.
  • Deduplication of model output.
  • Deduplication of a huge array of texts for LLM training.

Dataset deduplication

Let’s look at an example. We have a dataset in which there are duplicates (three bars of the same color).

According to the classics, we randomly divide the dataset into training and validation samples (train and validation) and see that two duplicates got into the training sample. This may cause our model to be somewhat biased towards these same duplicates, since the model actually sees these examples more than others during training (if we gave this example more weight in our loss function). As a result, the final generalization is likely to be worse.

Also, we can see that one of the duplicates made it to the validation part. This tells us that, in part, we will be evaluating our model on (almost) the same data we trained on. But this is probably incorrect and will lead to an overestimation of quality assessment metrics.

Deduplication of model output

Let’s imagine that we have a stream of news that we run through a machine learning model. As a result, we get a stream sorted, for example, by relevance.

This is the task of ranking the news feed, that is, selecting the most important news.

There are also duplicates in the source data that need to be removed to make the output of the model unique to the user.

Therefore, we can run this stream through another model, which will give us two unique pieces of news. In fact, this is one of the main deduplication applications we are actively working on.

Deduplication for LLM

This is the deduplication of those largest arrays of text, where all GPT  similar models are trained. And at the same time, there are often a large number of incomplete duplicates inside this array.

Removing these duplicates will save resources and reduce training time for large language models.

It should also be taken into account that an LLM trained on data with many duplicates is more likely to copy the generated content from the training data almost verbatim when issuing an answer.

Below is a graph from a recent Google Research paper that shows that if you generate text without any consistent prompts, about 40% of the time it will produce verbatim text from the data the model was trained on.

And if we want to reduce the number of verbatim repetitions, it is necessary to create such a model that will be trained on a deduplicated dataset.

What are the methods for solving this task?

The first method

is an assessment of two texts for similarity using

Jaccard index

. In fact, it is an intersection of many words between two texts.

We take the words that are in texts A and B and divide them by the total number of unique words in these texts. We get some metric (J) from zero to one. The closer the index is to one, the more similar the two texts are.

In those cases where the text A is a complete substring (all the text A is in the text plus some additional information), the metric can be modified to correctly account for such cases. To do this, it is necessary to divide many unique words between both texts into the smallest size.

Advantages of this method: simplicity and intuitiveness. Cons: Can’t scale to a large number of texts due to quadratic time complexity, because we need to compare all pairs of texts we have.

The second method based on hashing. This is a slightly more advanced method, which we will consider using an example Locality-sensitive hashing (LSH).

Let’s imagine that we have two duplicate documents. We pass them through shingling, that is, we break the text into tokens: verbal n-grams and shingles (symbolic n-grams).

And, after splitting both texts, we submit them to the MinHash hash function, which tries to maximize the number of collisions in the hashes.

This is how the first hyperparameter appears num_perm – This is actually the length of the original hash vector (or the number of hash functions applied to each text). Both texts were hashed, and we got six-digit hash vectors from them.

Then we introduce one additional hyperparameter. band_size – And we set a value, for example, equal to two. We need to divide these vectors into bands of the same length, so we divide each vector into three bands.

When crossing at least one of the bands, it is revealed that this pair belongs to duplicates. The last band is the same: probably a couple of duplicates.

In LSH’s book “Mining of Massive Datasets” it is mathematically shown that the Jaccard index, calculated on hash vectors of texts, approximates the real value of the index on the original texts, so in the future it is possible to calculate its approximate value and filter our output from random garbage that arose from – due to the randomness of the hash function of this algorithm.

Problems that we faced practically

The first problem

– This is the complexity of the assessment when using this algorithm.

We have a dataset with some labeled pairs.

A straight line is a pair of duplicates that we have in the markup. Dashed line – pairs of duplicates that are not in the markup. The markup is incomplete, because analyzing and marking all pairs among a large number of texts is very time-consuming and expensive, so it is rational to select some sub-sampling of texts for marking.

In addition to pairs 1-2 and 2-3, LSH will find and issue other pairs: 1-3, 1-4, 4-5, 1-5. We can find the pair 1–3 through the neighbors in the conditional graph, there will be no problems with it. But other pairs will be classified as false-positive errors when calculating metrics due to their absence in our markup.

It turns out that for the correct evaluation of such algorithms, it is necessary to mark all text pairs, and this is not a rational approach in industrial conditions.

The second problem is that in some borderline cases both LSH and the Jaccard index do not work quite correctly, and there is no way to fix this in a naive way.

For example, the difference between two small pieces of news is only one digit. The Jaccard index is quite high, so we will most likely classify this pair as duplicates. But in some programs, this number can be critical. It follows from this that it is impossible to deal with such cases either with the Jaccard Index or hashing methods.

As we can see, LSH is an excellent scalable algorithm that is configurable via hyperparameters and is also parallelizable on a CPU. But at the same time, it is difficult to evaluate it on an incomplete markup, and it is completely unable to cope correctly with borderline cases.

What to do?

We can go to a more sophisticated approach – use LSH as a first (candidate) model that will allow us to extract some candidates, including those with minimal errors and garbage, and then accelerate these pairs with another, smarter model that will learn on Pairwise loss. working on two texts, she was able to extract paired features and separate complex cases of duplicates from non-duplicates.

This approach should solve all the previous problems of basic methods, since with the help of markup we will be able to “sew” work with corner cases into the model, and also evaluate this model on an incomplete markup, accelerating only the existing pairs.

This approach scales well and is slightly slower than simple LSH, but the model quality is significantly better. Unfortunately, we haven’t tested this in practice yet, but the paper Noise-Robust De-Duplication at Scale describes a large gap between the model approach and pure LSH.

But there is a problem here: you need high-quality data markup. It is quite long and difficult, especially for voluminous texts.


Searching for incomplete duplicates is a separate task that differs from the task of semantic proximity of texts, and the approaches to their solution are completely different.

The LSH algorithm scales well and can be used for deduplication of huge text arrays for large language models, such as GPT, where deduplication quality as such does not play a key role.

In turn, for example, when deduplicating output to the user, where there are usually no large arrays of text and at the same time high quality is required, it makes sense to use the connection LSH + ML-model. This combination will scale well and show high quality, providing the user with a wealth of unique information.

Related posts