A Bloom filter is a probabilistic data structure for checking the membership of an element of a set

A Bloom filter is a probabilistic data structure for checking the membership of an element of a set

The target audience of the article:

  • technical workers,

  • technical managers,

  • students

Fundamental knowledge of data structures and algorithms is a prerequisite for understanding the article. The article is not intended to be a comprehensive guide to probabilistic data structures

Introduction

Data structures such as HashSet can be used for a small set of data, allowing you to check whether an element belongs to a set. At the same time, the use of checking whether an element belongs to a large data set can be costly. Temporal and spatial complexity can be linear in the worst case.

Probabilistic data structures provide constant temporal and spatial complexity by providing a non-deterministic response.

Requirements

Design a data structure with the following characteristics:

  • constant time complexity for checking the membership of a structure element,

  • a small amount of memory for checking the structure element’s membership,

  • insert and query operations must be parallelized,

  • Checking whether an element belongs to a structure can give approximate results.

What is a Bloom filter?

Bloom filter – It is a compact probabilistic data structure used to check whether an element belongs to a set. A Bloom filter reports that an element belongs to a set if the element belongs to the set, but may falsely report that the element belongs to the set even though the element is not part of the set (false positive error). The filter clearly reports the absence of an element in the set. Elements can be added to a Bloom filter, but cannot be removed from it.

The Bloom filter supports the following operations:

  • adding an element to a set,

  • checking whether an element belongs to a set.

How does the Bloom filter work?

A Bloom filter as a data structure is an array of bits of length. n as shown in Figure 1. Segment positions are represented by indices 0 to 9 for a bit array of length 10. All bits in a Bloom filter are zero when initialized, corresponding to an empty filter. A Bloom filter does not store the value of an element, but only stores a set of bits that identify it after computing hash functions over the element value.

Figure 1: An empty Bloom filter.

How is an element added to a Bloom filter?

Operations performed when adding an element to a Bloom filter:

  1. For the element is calculated k hash functions;

  2. The resulting hashes are divided modulo by n (by the length of the bit array) to get k values ​​identifying the bit positions of the Bloom filter array;

  3. All bits in the identifying positions are set to 1.

There is a possibility that some bits in the array may be set to 1 multiple times due to hash function collisions.

Figure 2: Adding an element to a Bloom filter.

.On Figure 2, the red and blue elements added to the Bloom filter are shown. The segments, which should be set to the value one for the red element, are determined by the results of dividing by the modulo of the calculated hash functions:

  • h1(red) mod 10 = 1,

  • h2(red) mod 10 = 3,

  • h3(red) mod 10 = 5.

The segments that must be set to one for the blue element are also determined by the results of dividing by the modulo of the calculated hash functions:

  • h1(blue) mod 10 = 4,

  • h2(blue) mod 10 = 5,

  • h3(blue) mod 10 = 9.

The segment in position five was set to one when adding both red and blue elements.

How to check the membership of a Bloom filter element?

Operations performed when checking the membership of a Bloom filter element:

  1. Hash functions are calculated for the element;

  2. The resulting hashes are divided modulo n (by the length of the bit array) to obtain values ​​identifying the bit positions of the Bloom filter array;

  3. A check is made that the received segments are set to one.

Figure 3: Bloom filter element membership check.

If any of the bits is zero, the element was not added to the Bloom filter. If all bits have a value of one, then the element maybe was added to the Bloom filter. Uncertainty regarding element belonging arises due to the possibility of setting some bits when adding different elements or collisions of hash functions.

on Figure 3, the membership check of the blue element of the Bloom filter is performed. The verified segments are obtained after division by the modulo of the values ​​obtained from the hash functions calculated by the element.

  • h1(blue) mod 10 = 4,

  • h2(blue) mod 10 = 5,

  • h3(blue) mod 10 = 9.

The blue element may belong to a Bloom filter because all bits are set to one.

Figure 4: Verify that an element does not belong to a Bloom filter.

on Figure 4the black element that is not in the Bloom filter is checked:

  • h1(black) mod 10 = 0,

  • h2(black) mod 10 = 3,

  • h3(black) mod 10 = 6.

The black element does not belong to the Bloom filter because in segment zero the bit value is zero. Checking other bits can be skipped.

Malfunction of the Bloom filter

Figure 5. False triggering of the Bloom filter.

In Figure 5, the verification of the membership of the green element of the Bloom filter, which was not added to it, is performed:

  • h1(green) mod 10 = 3,

  • h2(green) mod 10 = 4,

  • h3(green) mod 10 = 5.

The Bloom filter will report that the element belongs even though the green element was not added to it and the corresponding bits were set when the red and blue elements were added.

What is the asymptotic complexity of the Bloom filter?

The performance of the Bloom filter depends on the hash functions used. The faster the hash function is calculated, the faster each operation on the Bloom filter works.

Temporary difficulty

Operation

Temporary difficulty

adding an item

O(k) or constant

checking element belonging

O(k) or constant

Where k number of hash functions.

The time complexity of the Bloom filter does not depend on the number of elements already added to the filter. All k checks are independent and can be performed in parallel.

Spatial complexity

Despite the number of elements in the Bloom filter, the filter requires a constant number of bits, using several bits for each element. The Bloom filter does not store the value of the added elements, which means it has constant complexity O(1).

Bloom filter calculator

Calculator for the Bloom filter on hur.st can be used to select the desired filter size and to study the effect of various parameters. The accuracy of the filter depends on the following parameters:

  • number of hash functions (k),

  • quality of hash functions,

  • length of bit array (n),

  • the number of elements added to the Bloom filter.

For a Bloom filter, hash functions must have the following properties:

  • speed,

  • independence,

  • uniform distribution.

Advantages of the Bloom filter

The advantages of the Bloom filter are:

  • constant time complexity;

  • constant spatial complexity;

  • operations can be performed in parallel;

  • there are no false negatives;

  • ensures confidentiality, because it does not store the value of the elements;

  • possibility of bitwise union or bitwise intersection of two Bloom filters of the same dimension using the same hash functions (translation: added from comments to the article).

Disadvantages of the Bloom filter

The disadvantages of the Bloom filter are as follows:

  • does not support the delete operation;

  • false positives cannot be eliminated;

  • the need for arbitrary access to array bits, the indicators of which are randomly generated by hash functions.

Removing an element from a Bloom filter is not supported because there is no way to identify which bits should be removed. There may be other elements in the Bloom filter that use the same bits, and clearing them may result in false negative errors.

Bloom filter usage scenarios

Figure 6: A Bloom filter is used to reduce the number of database queries.

Bloom filters are useful in high-load systems to prevent expensive disk operations. on Figure 6For example, the service is searching for an element in a large table on the disk, which can lead to degradation of the bandwidth of the service. The Bloom filter can preprocess searches and cut off unnecessary disk operations, except for false positives. The Bloom filter can be applied in the following cases:

  • reduction of disk accesses for keys that do not exist in the database;

  • determine whether a user ID has been received;

  • filtering previously displayed posts for recommendation engines;

  • checking words for spelling mistakes and profanity c spellcheckers;

  • identification of dangerous urls, blocked IP addresses and fraudulent transactions;

  • repositories using log-structured tree with merge (LSM-tree), such as Cassandrause a Bloom filter to check the existence of a key in SSTable;;

  • MapReduce uses a Bloom filter to efficiently process large data sets.

Bloom filter implementations

  • Module RedisBloom supports the Bloom filter out of the box. The filter can occupy about 2% of the memory required for an arbitrary data set and will work faster than processing individual elements of the set as with a set.

  • An implementation of the Bloom filter in node.js for educational purposes is available in the repository github.com/guyroyse.

  • General purpose hash functions such as MurmurHash or FNV can be used in the Bloom filter.

Bloom filter modifications

Bloom filter with counting

Figure 7: Bloom filter with counting.

A counted Bloom filter includes an array of counters for each bit. The array of counters is initialized with zeros. The counters for the associated bits are incremented by one when added to the Bloom filter with a count. Bloom filter element membership check works like in basic filter. A counted Bloom filter supports the operation of removing an element from the filter. Operations to remove an element from a Bloom filter with a count:

  1. For the element is calculated k hash functions

  2. The resulting hashes are divided modulo by n (by the length of the bit array) to get k values ​​identifying the positions of the bits of the Bloom filter array

  3. Counters of received bit positions are decremented by one

  4. The corresponding bits are set to zero if the counter becomes zero after subtraction

A counted Bloom filter takes up more memory than a classical Bloom filter because it needs to store the counter values, even though most of the values ​​will be zero. Therefore, it is important to estimate how large the values ​​of the counters can be and how their size depends on the length of the filter n and number of hash functions k.

Scaled Bloom filter

Figure 8: Scaled Bloom filter.

A Bloom filter cannot be changed after it is saturated with elements, because it is impossible to determine which elements are added to the filter. At the same time, Bloom filters can be combined for scaling. Once the bloom filter is filled with elements, a new larger capacity filter can be added on top of the current filter. Querying the membership of a Bloom filter element requires a check in each filter. Additionally, when adding a new element, a check for the element’s presence in each filter below is required before adding to the top-level filter

Striped Bloom filter

Striped Bloom filter is implemented using an array of unsigned 64-bit integers and is sharded for parallelization. Each shard contains its own mutex, which provides increased throughput for additions and searches. The only limitation of the filter is that the size of the Bloom filter must be a multiple of the number of shards.

Other variations of the Bloom filter

There are other variations of the Bloom filter and data structures similar in functionality to the Bloom filter. Example (Trans.: Most examples added during translation):

Resume

Bloom filters are useful data structures in highly loaded systems. Bloom filters have constant temporal and spatial complexity. The trade-off when using Bloom filters and its variations is a non-deterministic result.

Related posts