Skip to content
Snippets Groups Projects
Commit 43f9bcc7 authored by Martin Mareš's avatar Martin Mareš
Browse files

Bloom filters: 1-band and k-band version

parent 512395ee
Branches
No related tags found
No related merge requests found
......@@ -678,4 +678,102 @@ is at most a~constant. This concludes the proof of the theorem.
TODO: Concentration inequalities and 5-independence.
\section{Bloom filters}
Bloom filters are a~family of data structures for approximate representation of sets
in a~small amount of memory. A~Bloom filter starts with an empty set. Then it supports
insertion of new elements and membership queries. Sometimes, the filter gives a~\em{false
positive} answer: it answers {\csc yes} even though the element is not in the set.
We will calculate the probability of false positves and decrease it at the expense of
making the structure slightly larger. False negatives will never occur.
\subsection{A trivial example}
We start with a~very simple filter. Let~$h$ be a~hash function from a~universe~$\cal U$
to $[m]$, picked at random from a~$c$-universal family. For simplicity, we will assume
that $c=1$. The output of the hash function will serve as an~index to an~array
$B[0\ldots m-1]$ of bits.
At the beginning, all bits of the array are zero.
When we insert an element~$x$, we simply set the bit $B[h(x)]$ to~1.
A~query for~$x$ tests the bit $B[h(x)]$ and answers {\csc yes} iff the bit is set to~1.
(We can imagine that we are hashing items to $m$~buckets, but we store only which
buckets are non-empty.)
Suppose that we have already inserted items $x_1,\ldots,x_n$. If we query the filter
for any~$x_i$, it always answers {\csc yes.} But if we ask for a~$y$ different from
all~$x_i$'s, we can get a~false positive answer if $x$~falls to the same bucket
as one of the $x_i$'s.
Let us calculate the probability of a~false positive answer.
For a~concrete~$i$, we have $\Pr_h[h(y) = h(x_i)] \le 1/m$ by 1-universality.
By union bound, the probability that $h(y) = h(x_i)$ for least one~$i$
is at most $n/m$.
We can ask an~inverse question, too: how large filter do we need to push error
probability under some $\varepsilon>0$? By our calculation, $\lceil n/\varepsilon\rceil$
bits suffice. It is interesting that this size does not depend on the size of the universe
--- all previous data structures required at least $\log\vert{\cal U}\vert$ bits per item.
On the other hand, the size scales badly with error probability: for example,
a~filter for $10^6$ items with $\varepsilon = 0.01$ requires 100\thinspace Mb.
\subsection{Multi-band filters}
To achieve the same error probability in smaller space, we can simply run
multiple filters in parallel. We choose $k$~hash functions $h_1,\ldots,h_k$,
where $h_i$~maps the universe to a~separate array~$B_i$ of $m$~bits. Each
pair $(B_i,h_i)$ is called a~\em{band} of the filter.
Insertion adds the new item to all bands. A~query asks all bands and it answers
{\csc yes} only if each band answered {\csc yes}.
We shall calculate error probability of the $k$-band filter. Suppose that we set
$m=2n$, so that each band gives a~false positive with probability at most $1/2$.
The whole filter gives a~false positive only if all bands did, which happens with
probability at most $2^{-k}$ if the functons $h_1,\ldots,h_k$ where chosen independently.
This proves the following theorem.
\theorem{
Let $\varepsilon > 0$ be the desired error probability
and $n$~the maximum number of items in the set.
The $k$-band Bloom filter with $m=2n$ and $k = \lceil \log (1/\varepsilon)\rceil$
gives false positives with probability at most~$\varepsilon$.
It requires $\O(m\log(1/\varepsilon))$ bits of memory and both
\alg{Insert} and \alg{Lookup} run in time $\O(k)$.
}
In the example with $n=10^6$ and $\varepsilon=0.01$, we get $m=2\cdot 10^6$
and $k=7$, so the whole filter requires 14\thinspace Mb. If we decrease
$\varepsilon$ to $0.001$, we have to increase~$k$ only to~10, so the memory
consumption reaches only 20\thinspace Mb.
\subsection{Optimizing parameters}
% The multi-band filter works well, but it turns out that we can fine-tune its parameters
% to obtain even better results (although only by a~constant factor). We can view it as
% an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
% such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
%
% Let us focus on a~single band first and calculate probability of false positives.
% This time, we will assume that all hash functions are perfectly random.
% A~concrete~$x_i$ maps to a~bucket~$j$ with probability $1/m$,
% so the probability that the bit~$B[j]$ is zero after all $n$ items are inserted
% is $(1-(1/m))^n$. This can be approximated by $\e^{-n/m}$.
% For an item~$y$ outside the set, we get a~false positive only if $B[h(y)]=1$,
% which happens with probability approximately $1-p$.
TODO
\subsection{Merged filters}
TODO
\subsection{Counting filters}
TODO
\subsection{Representing functions: the Bloomier filters}
TODO
\endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment