Bloom filters: 1-band and k-band version

43f9bcc7 · Martin Mareš · 512395ee · 43f9bcc7
Commit 43f9bcc7 authored 6 years ago by Martin Mareš
--- a/06-hash/hash.tex
+++ b/06-hash/hash.tex
@@ -678,4 +678,102 @@ is at most a~constant. This concludes the proof of the theorem.

 TODO: Concentration inequalities and 5-independence.

+\section{Bloom filters}
+
+Bloom filters are a~family of data structures for approximate representation of sets
+in a~small amount of memory. A~Bloom filter starts with an empty set. Then it supports
+insertion of new elements and membership queries. Sometimes, the filter gives a~\em{false
+positive} answer: it answers {\csc yes} even though the element is not in the set.
+We will calculate the probability of false positves and decrease it at the expense of
+making the structure slightly larger. False negatives will never occur.
+
+\subsection{A trivial example}
+
+We start with a~very simple filter. Let~$h$ be a~hash function from a~universe~$\cal U$
+to $[m]$, picked at random from a~$c$-universal family. For simplicity, we will assume
+that $c=1$. The output of the hash function will serve as an~index to an~array
+$B[0\ldots m-1]$ of bits.
+
+At the beginning, all bits of the array are zero.
+When we insert an element~$x$, we simply set the bit $B[h(x)]$ to~1.
+A~query for~$x$ tests the bit $B[h(x)]$ and answers {\csc yes} iff the bit is set to~1.
+(We can imagine that we are hashing items to $m$~buckets, but we store only which
+buckets are non-empty.)
+
+Suppose that we have already inserted items $x_1,\ldots,x_n$. If we query the filter
+for any~$x_i$, it always answers {\csc yes.} But if we ask for a~$y$ different from
+all~$x_i$'s, we can get a~false positive answer if $x$~falls to the same bucket
+as one of the $x_i$'s.
+
+Let us calculate the probability of a~false positive answer.
+For a~concrete~$i$, we have $\Pr_h[h(y) = h(x_i)] \le 1/m$ by 1-universality.
+By union bound, the probability that $h(y) = h(x_i)$ for least one~$i$
+is at most $n/m$.
+
+We can ask an~inverse question, too: how large filter do we need to push error
+probability under some $\varepsilon>0$? By our calculation, $\lceil n/\varepsilon\rceil$
+bits suffice. It is interesting that this size does not depend on the size of the universe
+--- all previous data structures required at least $\log\vert{\cal U}\vert$ bits per item.
+On the other hand, the size scales badly with error probability: for example,
+a~filter for $10^6$ items with $\varepsilon = 0.01$ requires 100\thinspace Mb.
+
+\subsection{Multi-band filters}
+
+To achieve the same error probability in smaller space, we can simply run
+multiple filters in parallel. We choose $k$~hash functions $h_1,\ldots,h_k$,
+where $h_i$~maps the universe to a~separate array~$B_i$ of $m$~bits. Each
+pair $(B_i,h_i)$ is called a~\em{band} of the filter.
+
+Insertion adds the new item to all bands. A~query asks all bands and it answers
+{\csc yes} only if each band answered {\csc yes}.
+
+We shall calculate error probability of the $k$-band filter. Suppose that we set
+$m=2n$, so that each band gives a~false positive with probability at most $1/2$.
+The whole filter gives a~false positive only if all bands did, which happens with
+probability at most $2^{-k}$ if the functons $h_1,\ldots,h_k$ where chosen independently.
+This proves the following theorem.
+
+\theorem{
+Let $\varepsilon > 0$ be the desired error probability
+and $n$~the maximum number of items in the set.
+The $k$-band Bloom filter with $m=2n$ and $k = \lceil \log (1/\varepsilon)\rceil$
+gives false positives with probability at most~$\varepsilon$.
+It requires $\O(m\log(1/\varepsilon))$ bits of memory and both
+\alg{Insert} and \alg{Lookup} run in time $\O(k)$.
+}
+
+In the example with $n=10^6$ and $\varepsilon=0.01$, we get $m=2\cdot 10^6$
+and $k=7$, so the whole filter requires 14\thinspace Mb. If we decrease
+$\varepsilon$ to $0.001$, we have to increase~$k$ only to~10, so the memory
+consumption reaches only 20\thinspace Mb.
+
+\subsection{Optimizing parameters}
+
+% The multi-band filter works well, but it turns out that we can fine-tune its parameters
+% to obtain even better results (although only by a~constant factor). We can view it as
+% an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
+% such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
+%
+% Let us focus on a~single band first and calculate probability of false positives.
+% This time, we will assume that all hash functions are perfectly random.
+% A~concrete~$x_i$ maps to a~bucket~$j$ with probability $1/m$,
+% so the probability that the bit~$B[j]$ is zero after all $n$ items are inserted
+% is $(1-(1/m))^n$. This can be approximated by $\e^{-n/m}$.
+% For an item~$y$ outside the set, we get a~false positive only if $B[h(y)]=1$,
+% which happens with probability approximately $1-p$.
+
+TODO
+
+\subsection{Merged filters}
+
+TODO
+
+\subsection{Counting filters}
+
+TODO
+
+\subsection{Representing functions: the Bloomier filters}
+
+TODO
+
 \endchapter