How to find similar pictures

Short description

Self-filling websites have become popular with the rise of Web 2.0, but this has also led to the problem of duplicate content. While text can be easily searched through full-text search engines or plagiarism detectors, duplicate pictures can be more difficult to deal with. However, there are algorithms that can create hashes or digital fingerprints of pictures and compare them for duplicates. By using a perceptual hash or difference hash, pictures can be reduced to a short string of bits for easier storage and comparison. Vptrees can also be used to create an index of pictures for faster searching.

How to find similar pictures

Web 2.0 is a wonderful thing. Self-service sites. Users fill them with their own hands (“post content”, as they say now). They attacked themselves, laughed themselves. And the owner of the site only pays for hosting and cuts coupons on advertising. Easy.

But our life is so strangely arranged that there are no pluses without minuses, and often the disadvantages are generally a continuation of the advantages. There are problems with self-filling sites, too. In the sense of double.

Many visitors don’t like Dubli, especially the old-timers, who remember by the teeth the memasiks that appeared during the time of Preved-Bedmed and the Alban Yazigg. They meet each new appearance with snorts and threats to immediately unsubscribe.

What to do? Of course, call the iron machine for help – let it search for the accordion itself.

With the text, the procedure is more or less mastered. A simple full-text search engine can be found in almost any DBMS. If someone needs something more complex, for example, taking into account the morphology of the language, Sphinx, Solr and so on are at your service. Even if your queries go so far that you want to look for not just duplicates, but remakes of bearded anecdotes in which Vasyl Ivanovich and Petka are replaced by Darth Vader and Harry Potter in the spirit of the times, at your service td/idf, cosines of the angle between vectors and other methods of finding plagiarism.

The text has been dealt with. But what to do with pictures? After all, this is probably somehow possible? Google searches for similar ones, albeit lousy. Yandex searches, and even better than Google. But what about Yandex — some entertainment sites like Pikabu also have debayanization tools.

You can. And not very difficult, surprisingly, because smart people have already done the most difficult part of the work for us, inventing and testing beautiful and witty algorithms.

How to search for duplicate pictures? Very easy.

Step 1. For each picture, create its digital fingerprint (fingerprinting) – a sequence of bits that reflects its content.

Step 2. Create a print for the picture and compare it with those already in our database.

Voila! And he, as usual, lives already in the details.

Do it once: we create a print

Common sense suggests that the print should not be too long – the shorter the better. After all, it must be stored in the database itself, and then also compared with others. If it is the size of the image itself, it will be difficult to do both.

So, we need some kind of hash function that will collapse the contents of the picture into a short string of bits.

Traditional hash functions such as the sha family are, of course, no longer available. They are too sensitive – even if the binary content of the two images differs by just one bit, the calculated hashes will have nothing in common at all (although, oddly enough, it is sha1 that is used in the Wikipedia engine. We could invent a better one).

There is nothing more to say about cryptographic hashes. They react just as nervously to an extra bit, only at the same time they are considered even more tense.

We need something that would give similar hashes for similar pictures, and still have similarities from a human perspective, not a machine. It was not the bitmaps that were similar to each other, but the picture itself. Here is a chair, and there is a chair, here is the sun, and there it is.

And such hashes are really invented. It is possible, for example, to use machine learning methods to recognize what we have drawn – let’s say, a dog, to recognize what breed it is, what color it is and in what pose it is standing, and then encode these features in the hash.

But we don’t really need such searches, we can do it a little easier. After all, we do not need to classify what is drawn, we just need to find duplicates.

Therefore, our technique will be as follows: the image is decolorized, that is, it is translated into 50 64 shades of gray and is greatly reduced. The first operation makes similar pictures that differ only in colors, the second greatly reduces the volume of information for processing and removes secondary details. Then the resulting picture is considered a hash.

You can, for example, calculate the average color over the entire image and set a value of 0 or 1 for each pixel – darker or lighter than the average. It’s quick and easy, but the results are pretty mediocre.

It is much better to use a discrete cosine transformation, resulting in a so-called perceptual hash (perceptual hash, phash). You don’t even need to write the algorithm yourself, phash is already included in the most popular image processing library — ImageMagick.

Another option is a difference hash (difference hash, dhash). It gives results a little worse than perceptive, but quite acceptable, and it is considered simpler and faster. The idea is the same, but at the same time, the difference between neighboring pixels is encoded in bits – whether it is positive or negative.

It is possible, of course, to invent something else in the same way, but is it necessary?

Do two: compare hashes

It’s very easy. In common sense, we use the Hamming distance, and in simple terms, we count how many bits differ in two hashes.

If, for example, you use MySQL or its little sister Maria, this can be done using a built-in function:

SELECT * FROM images WHERE BIT_COUNT(newMem ^ hash) <= 6

newMem here – dhash (or phash) of the new picture, duplicates of which we want to find, hash – accordingly, the hash of the picture that is already in the database. The cap symbolizes a bitwise XOR, and the number 6 is obtained empirically.

Here is an example of the original image for clarity.

Idyllic scene on the beach of Jemet

And here are its various distortions and perversions, with the Hamming distances calculated for them (the distance is given for the original pictures with a size of 3000×2000, I compressed them for placement on the website):

We reduce the brightness – 1
We take all colors – 1
We make the colors acidic – 0
Cut the right edge a little – 6
Lightly cut from all sides – 1
We add the inscription – 2
We turn up the sharpness – 1
We make a mirror – 37

As you can see, the hash (in this case – dhash) copes well with small changes in the original image – those that come from light editing, of course, and not from deliberate attempts to cheat the algorithm. He didn’t pull only a mirror flip of the image, but no one prevents us from taking a questionable picture, mirroring it and running it through the base again.

If your site gathers users who post on it their lovingly collected photos of old trams with bugles, Russian stoves, lifeboats on the beach, lemon trees grown with their own hands and similar fascinating things, that is, serious, solid people, not teenagers, from the sick vanities that wanted to hack the system, you don’t need more than dhash.

Do two: compare hashes for real

Obviously, the SQL query shown above is here for blazer. If you need to compare a hundred hashes, it is fine, but it will probably think twice about thousands. Why is clear: it compares the test hash with all the others one by one, and because of the function call, it cannot use the indexes of the base.

The situation is further aggravated by the fact that the algorithm for calculating hashes is slightly non-deterministic and even for the same pictures can give slightly different results – only by one or two bits, but this is enough that even a search for exactly the same pictures cannot be carried out using a simple query on equality

And with indices, too, everything is unclear – we need one that shows the Hamming distance from… From what? So there it is. Obviously, something tricky is needed here.

Such indexes, or rather trees, exist in nature. For example, there is such an exotic as vptree (vantage point tree), which can be artistically translated as “a tree with points of convenient observation of other points.” It is easy to understand its principle – it is an ordinary binary tree, only the points in it are grouped according to their distance from each other. Building a tree and searching through it is not so fun – it’s not easy to get into the algorithm the first time, but it’s not the biggest problem. It is much worse that such a tree is built once and when new points are added or old ones are removed, it has to be rebuilt more or less completely. This means that it is not very suitable for use on a site where new material is constantly added and old material is often removed.

And how does Google deal with this? Oh, Google, or rather, Moses Charikar (Moses Charikar) worked for it, invented a very sophisticated thing called Simhash, which is much simpler than tangled spatial trees. Here’s a look:

Let’s assume for concreteness that our hash consists of 64 bits. Let’s divide them into 8 bytes.

We want to find hashes that are no more than 6 bits away from the tested one. But this means that even if all 6 bits are in different bytes of the hash, no matter what, some two bytes will completely match in both hashes.

There they are – two matching bytes

And this means that we can pre-select from the database those hashes that match our two bytes that are being checked, and calculate the distance only for them. This decently reduces the amount of work – by 200-300 times.

How to do it?

According to the original idea, it is necessary to create additional tables for hashes, the number of which is equal to the number of possible combinations of two bytes out of eight, in our case = 28. The same hashes are stored in each table, but with rearranged bytes. In the main one, the first and second bytes are the first and second bytes of the hash, in the first additional one, the first and third, in the second one, the first and fourth, and so on, until we go through all the combinations. Here, in case you are too lazy to consider them yourself.

{1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {1, 7}, {1, 8}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {2, 7}, {2, 8}, {3, 4}, {3, 5}, {3, 6}, {3, 7}, {3, 8}, {4, 5}, {4, 6}, {4, 7}, {4, 8}, {5, 6}, {5, 7}, {5, 8}, {6, 7}, {6, 8}, {7, 8}

And now we will sort these tables by the first two bytes and can use binary search to select candidates for further comparison!

However, here we can evade the advice of the respected author. After all, we don’t have Google, but the site is simpler, we store much fewer pictures, and therefore we don’t need to invent our own turbocharged bikes. After all, we already have a database – we keep the links and attributes of our photos somewhere. Let’s use it and put the same hash broken into bytes next to the hash in the form of a binary string.

CREATE TABLE images (
  name VARCHAR(100) NOT NULL,
  hash BIGINT UNSIGNED NOT NULL,
  b1 TINYINT UNSIGNED NOT NULL,
  ...
  b8 TINYINT UNSIGNED NOT NULL,
)

Naturally, we will add a separate index to each byte, and then we will search for our hash with the following query, which goes through all possible combinations of two bytes out of eight:

SELECT name FROM images WHERE (
  (b1=newMem_b1 AND b2=newMem_b2) OR
  (b1=newMem_b1 AND b3=newMem_b3) OR
  ...
  (b7=newMem_b7 AND b8=newMem_b8)
) 
AND BIT_COUNT(newMem ^ hash) <= 6

The request for a person is, of course, long, but the database looks at it differently, and it is not difficult for it to optimize it, using indexes on fields b1 … b8.

Combinations 2 to 8, of course, are not written on tablets, this is only one of the options, although it is practically convenient. Depending on the expected number of stored images and the Hamming distance we want to search, it is possible to split the hash into a different number of parts, not necessarily even the same length.

Few practical figures. On my old 10-year-old machine, generating a hash in nodejs takes about a second and a half, of which a second is spent reducing the image using ImageMagick.

Searching for one hash in a database of 30,000 photos (I couldn’t find more cats and figure girls) takes about two seconds, of which a second is again used to reduce the original image. This is without indexes – I was too lazy to create them for the prototype – so the search scaling section is there. Whether it will be enough for a million or two photos is an open question. If you have so many different photos, it would be interesting to know your impressions.

And finally, one more question. What if we want to find more distant photos in the Hamming sense, which differ not by 6 bits, but by 20 or 30? I answer: this is a completely different task — if in English, then not near duplicate search, but similarity detection, and a completely different story, for which the described methods are not suitable and something else must be invented.

If you look events in Mykolaiv – https://city-afisha.com/afisha/

Related posts