Automatic image mining
In previous articles, we talked about how to create a photo gallery with your own search engine [1,2]1. But where do we find images for our gallery? We would have to manually search for sources of “good” images, and then manually verify that each image is “good”. Can both of these tasks be automated? The answer is yes.
Contents
Data source
Reddit is a social network with the features of a forum and a link aggregator. Included in the top 20 most visited sites in the world according to SimilarWeb. Its most important features: a lot of original content, it is generated and curated by users, and most importantly, the content is easy to parse, there is an API for this, and the site is indexed by search engines.
Site data has long been used to create various datasets for machine learning. The bench will soon be closed, but for now it is possible to access a huge amount of information for free, although it has become much more difficult. Previously, Pushshift was used to bypass various limits. Pushshift is a Reddit parsing project that provided an API and advanced search (using Elastic Search) for free and without restrictions. Unfortunately, Reddit banned him for violating the terms of use of the service.
We will have to find some workarounds. The Reddit API has the following request:
https://www.reddit.com/r/{subreddit_name}/new.json?sort=new&limit=100
which allows you to retrieve up to 100 recent posts from a specific subreddit. According to various estimates, there are 130 to 3 million subreddits on Reddit. No one will allow us to make so many requests, so we will look further. There is an automatic subreddit – /r/all, which automatically adds posts from all over Reddit, that’s where we’ll download the pictures.
Outlier detection
Next, we need to determine if the image is appropriate. I had 10k images lying around in the previous version of the photo gallery, all of them can be considered good. Anomaly detection methods will help us in this. These algorithms work with vectors — features of objects, so features need to be extracted from the image. For this we use CLIP ViT B/16 because we have already used this model to find similar images.
We do not have examples of “bad” images, so we use methods where marking is not required – Unsupervised Anomaly Detection.
There is a great library that implements many modern algorithms for anomaly detection – PyOD. The developers of this library recently released an article where they tested 30 algorithms on different datasets.
Result:
None of the unsupervised methods is statistically better than the others
Since Unsupervised methods are +- levels, I decided to use the fastest and most documented. The choice fell on Gaussian Mixture Models.
Gaussian Mixture Models
We use the implementation from scikit-learn. To choose n_components, you need to either use some a priori knowledge of the distribution (eg you know for sure there are 3 clusters, then use n_components=3) or use information criteria such as AIC and BIC. We choose such n_components that minimizes AIC or BIC.
The test described above suggested a number of components of 24, but through tests I settled on a value of 16.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 16, covariance_type="full")
gmm.fit(features)
After that, we can estimate the log-likelihood of each image
gmm.score_samples(features)
Histogram of the training dataset, x-axis – gmm score.
And here’s a histogram of the dataset I filtered (/r/EarthPorn). Scores are truncated at 0 and -3000 for better clarity.
Now we need to choose a threshold, by which we will consider whether the image suits us or not. Through tests, I found out that a score>500 is already a good image.
Watermark detection
Unfortunately, the presence of watermarks does not greatly affect the gmm score. So I trained a binary classifier (no Utermark/there is an Utermark) and also labeled the dataset with 22k images. The dataset can be found here and here. Fun fact, recently discovered that 3-5 Hindus completely copied my dataset and posted it on Kaggle as theirs without mentioning where they got it from. One was even in the top 18 of the world dataset leaderboards. Surprisingly, the report button really works and after a couple of days the copied datasets were demolished.
CLIP
At first I thought it would be a good idea to use the same features we used for anomaly detection. We feed these signs into a simple 3-layer fully connected network, the accuracy was 93-95 percent and we could probably end there. But I wanted to achieve as low a false negative rate as possible. I noticed that when compressing the image to 224×224, some of the watermarks completely disappear, so I decided to try to do it like this:
-
Compress the image to 448×448
-
Get the features of each of the quadrants (224×224) of the image
-
Concatenate them
-
Submit the new vector to the full network
The new accuracy was 97-98% and there were much fewer false negatives, success!
Various augmentations were also used during training, such as rotations, blurs, JPEG compression artifacts. Dots were not used because we do not know for sure in which part of the image the Utermark may be.
EfficientNetV2 and onnx
Since CLIP on the GPU takes about 2GB of VRAM, and my server (headless computer standing on the cabinet) is used not only for this project, I wanted to calculate everything on the CPU, using less RAM.
With torch.onnx, I converted the visual part of the model (CLIP has two subnets, one handles text and the other images) to onnx format, and then, to get rid of torch completely, I rewrote some functions responsible for image normalization to numpy.
Using RAM after warming up (visual+textual CLIP):
Framework |
RAM (MB) |
cpu pytorch |
1194 |
cpu onnxruntime |
748 |
Okay, now less memory is needed, but what about performance? Calculating features 4 times for 1 image is somehow wasteful, a separate model is needed.
Some of the most efficient in terms of Parameters/FLOPs are models of the EfficientNet family.
In 2021, a sequel was released – EfficientNetV2, we will try to use the smallest one – EfficientNetV2-B0. We will take the implementation and weights from timm.
We train all layers, we use images with a size of 512×512. So we can see even more small watermarks. The results are slightly better than with CLIP.
EfficientNetV2-B0 can be found in the github repository and on hugging face.
anti_sus
[Github]
anti_sus is a zeromq server for image filtering. It accepts a batch of rgb images (numpy array) as input and grows the indexes of good images. It has 2-stage filtering:
In the future, I would like to add models that can perform Image Quality Assessment (IQA) and determine whether an image is synthetic, i.e. generated by GANs or diffusion models.
nomad
[Github]
nomad is a reddit parser with various rules for getting the best quality images from reddit as well as links to imgur and flickr. Supports work with anti_sus and scenery (photo gallery).
The results
About 154 images in ~14 hours, threshold == 700 (taking into account various sleep() to reduce the probability of an ip ban)
Integration with photo gallery (scenery)
The combination of nomad+anti_sus can be used in two different ways: we can use it as a standalone tool and just save new images to the file system, or we can integrate it with a script. This way new images will automatically be added to our photo gallery and we can use ambience to check if the image is a duplicate.
Scenery.cx currently has 160k images, ~158k of which are filtered /r/EarthPorn.
nomad+anti_sus work and automatically add new images.
1 A newer version of the image search article can be found here.
If you look events in Mykolaiv – https://city-afisha.com/afisha/