A Bloom filter is a probabilistic data structure for checking the membership of an element of a set
The target audience of the article:

technical workers,

technical managers,

students
Fundamental knowledge of data structures and algorithms is a prerequisite for understanding the article. The article is not intended to be a comprehensive guide to probabilistic data structures
Contents
 1 Introduction
 2 Requirements
 3 What is a Bloom filter?
 4 How does the Bloom filter work?
 5 How is an element added to a Bloom filter?
 6 How to check the membership of a Bloom filter element?
 7 Malfunction of the Bloom filter
 8 What is the asymptotic complexity of the Bloom filter?
 9 Bloom filter calculator
 10 Advantages of the Bloom filter
 11 Disadvantages of the Bloom filter
 12 Bloom filter usage scenarios
 13 Bloom filter implementations
 14 Bloom filter modifications
 15 Resume
Introduction
Data structures such as HashSet can be used for a small set of data, allowing you to check whether an element belongs to a set. At the same time, the use of checking whether an element belongs to a large data set can be costly. Temporal and spatial complexity can be linear in the worst case.
Probabilistic data structures provide constant temporal and spatial complexity by providing a nondeterministic response.
Requirements
Design a data structure with the following characteristics:

constant time complexity for checking the membership of a structure element,

a small amount of memory for checking the structure element’s membership,

insert and query operations must be parallelized,

Checking whether an element belongs to a structure can give approximate results.
What is a Bloom filter?
Bloom filter – It is a compact probabilistic data structure used to check whether an element belongs to a set. A Bloom filter reports that an element belongs to a set if the element belongs to the set, but may falsely report that the element belongs to the set even though the element is not part of the set (false positive error). The filter clearly reports the absence of an element in the set. Elements can be added to a Bloom filter, but cannot be removed from it.
The Bloom filter supports the following operations:

adding an element to a set,

checking whether an element belongs to a set.
How does the Bloom filter work?
A Bloom filter as a data structure is an array of bits of length. n as shown in Figure 1. Segment positions are represented by indices 0 to 9 for a bit array of length 10. All bits in a Bloom filter are zero when initialized, corresponding to an empty filter. A Bloom filter does not store the value of an element, but only stores a set of bits that identify it after computing hash functions over the element value.
How is an element added to a Bloom filter?
Operations performed when adding an element to a Bloom filter:

For the element is calculated k hash functions;

The resulting hashes are divided modulo by n (by the length of the bit array) to get k values identifying the bit positions of the Bloom filter array;

All bits in the identifying positions are set to 1.
There is a possibility that some bits in the array may be set to 1 multiple times due to hash function collisions.
.On Figure 2, the red and blue elements added to the Bloom filter are shown. The segments, which should be set to the value one for the red element, are determined by the results of dividing by the modulo of the calculated hash functions:

h1(red) mod 10 = 1,

h2(red) mod 10 = 3,

h3(red) mod 10 = 5.
The segments that must be set to one for the blue element are also determined by the results of dividing by the modulo of the calculated hash functions:

h1(blue) mod 10 = 4,

h2(blue) mod 10 = 5,

h3(blue) mod 10 = 9.
The segment in position five was set to one when adding both red and blue elements.
How to check the membership of a Bloom filter element?
Operations performed when checking the membership of a Bloom filter element:

Hash functions are calculated for the element;

The resulting hashes are divided modulo n (by the length of the bit array) to obtain values identifying the bit positions of the Bloom filter array;

A check is made that the received segments are set to one.
If any of the bits is zero, the element was not added to the Bloom filter. If all bits have a value of one, then the element maybe was added to the Bloom filter. Uncertainty regarding element belonging arises due to the possibility of setting some bits when adding different elements or collisions of hash functions.
on Figure 3, the membership check of the blue element of the Bloom filter is performed. The verified segments are obtained after division by the modulo of the values obtained from the hash functions calculated by the element.

h1(blue) mod 10 = 4,

h2(blue) mod 10 = 5,

h3(blue) mod 10 = 9.
The blue element may belong to a Bloom filter because all bits are set to one.
on Figure 4the black element that is not in the Bloom filter is checked:

h1(black) mod 10 = 0,

h2(black) mod 10 = 3,

h3(black) mod 10 = 6.
The black element does not belong to the Bloom filter because in segment zero the bit value is zero. Checking other bits can be skipped.
Malfunction of the Bloom filter
In Figure 5, the verification of the membership of the green element of the Bloom filter, which was not added to it, is performed:

h1(green) mod 10 = 3,

h2(green) mod 10 = 4,

h3(green) mod 10 = 5.
The Bloom filter will report that the element belongs even though the green element was not added to it and the corresponding bits were set when the red and blue elements were added.
What is the asymptotic complexity of the Bloom filter?
The performance of the Bloom filter depends on the hash functions used. The faster the hash function is calculated, the faster each operation on the Bloom filter works.
Temporary difficulty
Operation 
Temporary difficulty 
adding an item 
O(k) or constant 
checking element belonging 
O(k) or constant 
Where k number of hash functions.
The time complexity of the Bloom filter does not depend on the number of elements already added to the filter. All k checks are independent and can be performed in parallel.
Spatial complexity
Despite the number of elements in the Bloom filter, the filter requires a constant number of bits, using several bits for each element. The Bloom filter does not store the value of the added elements, which means it has constant complexity O(1).
Bloom filter calculator
Calculator for the Bloom filter on hur.st can be used to select the desired filter size and to study the effect of various parameters. The accuracy of the filter depends on the following parameters:

number of hash functions (k),

quality of hash functions,

length of bit array (n),

the number of elements added to the Bloom filter.
For a Bloom filter, hash functions must have the following properties:

speed,

independence,

uniform distribution.
Advantages of the Bloom filter
The advantages of the Bloom filter are:

constant time complexity;

constant spatial complexity;

operations can be performed in parallel;

there are no false negatives;

ensures confidentiality, because it does not store the value of the elements;

possibility of bitwise union or bitwise intersection of two Bloom filters of the same dimension using the same hash functions (translation: added from comments to the article).
Disadvantages of the Bloom filter
The disadvantages of the Bloom filter are as follows:

does not support the delete operation;

false positives cannot be eliminated;

the need for arbitrary access to array bits, the indicators of which are randomly generated by hash functions.
Removing an element from a Bloom filter is not supported because there is no way to identify which bits should be removed. There may be other elements in the Bloom filter that use the same bits, and clearing them may result in false negative errors.
Bloom filter usage scenarios
Bloom filters are useful in highload systems to prevent expensive disk operations. on Figure 6For example, the service is searching for an element in a large table on the disk, which can lead to degradation of the bandwidth of the service. The Bloom filter can preprocess searches and cut off unnecessary disk operations, except for false positives. The Bloom filter can be applied in the following cases:

reduction of disk accesses for keys that do not exist in the database;

determine whether a user ID has been received;

filtering previously displayed posts for recommendation engines;

checking words for spelling mistakes and profanity c spellcheckers;

identification of dangerous urls, blocked IP addresses and fraudulent transactions;

repositories using logstructured tree with merge (LSMtree), such as Cassandrause a Bloom filter to check the existence of a key in SSTable;;

MapReduce uses a Bloom filter to efficiently process large data sets.
Bloom filter implementations

Module RedisBloom supports the Bloom filter out of the box. The filter can occupy about 2% of the memory required for an arbitrary data set and will work faster than processing individual elements of the set as with a set.

An implementation of the Bloom filter in node.js for educational purposes is available in the repository github.com/guyroyse.

General purpose hash functions such as MurmurHash or FNV can be used in the Bloom filter.
Bloom filter modifications
Bloom filter with counting
A counted Bloom filter includes an array of counters for each bit. The array of counters is initialized with zeros. The counters for the associated bits are incremented by one when added to the Bloom filter with a count. Bloom filter element membership check works like in basic filter. A counted Bloom filter supports the operation of removing an element from the filter. Operations to remove an element from a Bloom filter with a count:

For the element is calculated k hash functions

The resulting hashes are divided modulo by n (by the length of the bit array) to get k values identifying the positions of the bits of the Bloom filter array

Counters of received bit positions are decremented by one

The corresponding bits are set to zero if the counter becomes zero after subtraction
A counted Bloom filter takes up more memory than a classical Bloom filter because it needs to store the counter values, even though most of the values will be zero. Therefore, it is important to estimate how large the values of the counters can be and how their size depends on the length of the filter n and number of hash functions k.
Scaled Bloom filter
A Bloom filter cannot be changed after it is saturated with elements, because it is impossible to determine which elements are added to the filter. At the same time, Bloom filters can be combined for scaling. Once the bloom filter is filled with elements, a new larger capacity filter can be added on top of the current filter. Querying the membership of a Bloom filter element requires a check in each filter. Additionally, when adding a new element, a check for the element’s presence in each filter below is required before adding to the toplevel filter
Striped Bloom filter
Striped Bloom filter is implemented using an array of unsigned 64bit integers and is sharded for parallelization. Each shard contains its own mutex, which provides increased throughput for additions and searches. The only limitation of the filter is that the size of the Bloom filter must be a multiple of the number of shards.
Other variations of the Bloom filter
There are other variations of the Bloom filter and data structures similar in functionality to the Bloom filter. Example (Trans.: Most examples added during translation):
Resume
Bloom filters are useful data structures in highly loaded systems. Bloom filters have constant temporal and spatial complexity. The tradeoff when using Bloom filters and its variations is a nondeterministic result.