Algorithm - Bloom Filter

Subscribe Send me a message home page tags


Related Reading

Lecture Note: Bloom Filter (local copy)

Description

Bloom Filter is a data structure that supports the following operations:

It uses randomized algorithm to answer the query. If it says the queried element is not inserted, then it's always true. However, if it says the queried element is inserted, then there is a possibility that this answer is wrong. (e.g. false positive).

The main idea of Bloom Filter is when comparing two elements during the hash value collision resolution process, instead of comparing the two elements directly, we compare their "signature". The signature of an element is a set of hash values. In this way, we don't need to store the original elements anymore.

To implement a Bloom Filter, we need

Each hash function \( h_i \) maps an element \(x\) to a integer value in \( \{0, 1, ..., m - 1\} \).

To construct the signature of an element \(x\), we first calculate the k hash values. These hash values are used as index and we then set the bit to 1 on those indexes.

1
2
3
def insert(x):
    for i in range(k):
        bitarray[hashFunctions[k](x)] = 1

To answer a query on an element \(x\), we first calculate the \(k\) hash values. Again, these values are used as index. We will return "Yes" if the bits on these index are all 1. Otherwise, we return "No".

1
2
3
4
5
def isInserted(x):
    for i in range(k):
        if bitarray[hashFunctions[k](x)] != 1:
            return False
    return True

Implementation Details

In the previous section, we mentioned that we need \(k\) hash functions. In practice, we don't really need to manually create \(k\) different hash functions. What we can do is to split a hash function into \(k\) different "smaller" hash functions. For example, suppose we have a hash function that returns a 128 bits integer. We can divide the 128 bits into 4 equal parts and each part can be considered as a hash function that returns a 32 bits integer. In this ways, we split one hash function into 4 hash functions.

Analysis

The probability of getting a fals positive is

$$ (1 - e^{-kn/m})^k $$

where \(n\) is the number of unique elements inserted.

In order to have a false positive for a given element \(x\), the bit in the bit vector on all the \(k\) index specified by the hash functions should be one.

$$ \begin{eqnarray} P(\textrm{bitarray}[i] = 1) & = & 1 - P(\textrm{bitarray}[i] \neq 1) \\ & = & 1 - (1 - \frac{1}{m})^{kn} \end{eqnarray} $$

The exponent is \(kn\) because there are \(n\) unique inserted elements and for each element we calculate \(k\) hash values. The \( (1 - \frac{1}{m}) \) part is the probability of not setting bit at index \(i\) to 1 for each calculated hash value.

If we assume \(m\) is a relatively large number, then the expression can be approximated by \( (1 - e^{-kn/m})^k \).

Thoughts

The algorithm is simple and magical. What is really happening here and why could we save space? The save comes from the fact that we don't need to store the original elements anymore. Recall in a standard hash table implementation, in addition to the hash function we also need to provide a equal function. The equal function is needed because we need to resolve hash value collisions from time to time. It's the equal function that requires storing the elements. In Bloom Filter, the hash value collision is resolved not by comparing two elements directly but comparing the "signature" of the two elements. The signature of an element is a set of hash value which can be calculated at runtime.

----- END -----

Welcome to join reddit self-learning community.
Send me a message Subscribe to blog updates

Want some fun stuff?

/static/shopping_demo.png