Merge branch 'master' of gitlab.kam.mff.cuni.cz:mj/dsbook

07035907 · Martin Mareš · fe7ab7f1 · e7889d63 · 07035907
Commit 07035907 authored 6 years ago by Martin Mareš
--- a/06-hash/hash.tex
+++ b/06-hash/hash.tex
@@ -753,4 +753,122 @@ is at most a~constant. This concludes the proof of the theorem.
 TODO: Concentration inequalities and 5-independence.
+\section{Bloom filters}
+Bloom filters are a~family of data structures for approximate representation of sets
+in a~small amount of memory. A~Bloom filter starts with an empty set. Then it supports
+insertion of new elements and membership queries. Sometimes, the filter gives a~\em{false
+positive} answer: it answers {\csc yes} even though the element is not in the set.
+We will calculate the probability of false positves and decrease it at the expense of
+making the structure slightly larger. False negatives will never occur.
+\subsection{A trivial example}
+We start with a~very simple filter. Let~$h$ be a~hash function from a~universe~$\cal U$
+to $[m]$, picked at random from a~$c$-universal family. For simplicity, we will assume
+that $c=1$. The output of the hash function will serve as an~index to an~array
+$B[0\ldots m-1]$ of bits.
+At the beginning, all bits of the array are zero.
+When we insert an element~$x$, we simply set the bit $B[h(x)]$ to~1.
+A~query for~$x$ tests the bit $B[h(x)]$ and answers {\csc yes} iff the bit is set to~1.
+(We can imagine that we are hashing items to $m$~buckets, but we store only which
+buckets are non-empty.)
+Suppose that we have already inserted items $x_1,\ldots,x_n$. If we query the filter
+for any~$x_i$, it always answers {\csc yes.} But if we ask for a~$y$ different from
+all~$x_i$'s, we can get a~false positive answer if $x$~falls to the same bucket
+as one of the $x_i$'s.
+Let us calculate the probability of a~false positive answer.
+For a~concrete~$i$, we have $\Pr_h[h(y) = h(x_i)] \le 1/m$ by 1-universality.
+By union bound, the probability that $h(y) = h(x_i)$ for least one~$i$
+is at most $n/m$.
+We can ask an~inverse question, too: how large filter do we need to push error
+probability under some $\varepsilon>0$? By our calculation, $\lceil n/\varepsilon\rceil$
+bits suffice. It is interesting that this size does not depend on the size of the universe
+--- all previous data structures required at least $\log\vert{\cal U}\vert$ bits per item.
+On the other hand, the size scales badly with error probability: for example,
+a~filter for $10^6$ items with $\varepsilon = 0.01$ requires 100\thinspace Mb.
+\subsection{Multi-band filters}
+To achieve the same error probability in smaller space, we can simply run
+multiple filters in parallel. We choose $k$~hash functions $h_1,\ldots,h_k$,
+where $h_i$~maps the universe to a~separate array~$B_i$ of $m$~bits. Each
+pair $(B_i,h_i)$ is called a~\em{band} of the filter.
+Insertion adds the new item to all bands. A~query asks all bands and it answers
+{\csc yes} only if each band answered {\csc yes}.
+We shall calculate error probability of the $k$-band filter. Suppose that we set
+$m=2n$, so that each band gives a~false positive with probability at most $1/2$.
+The whole filter gives a~false positive only if all bands did, which happens with
+probability at most $2^{-k}$ if the functons $h_1,\ldots,h_k$ where chosen independently.
+This proves the following theorem.
+\theorem{
+Let $\varepsilon > 0$ be the desired error probability
+and $n$~the maximum number of items in the set.
+The $k$-band Bloom filter with $m=2n$ and $k = \lceil \log (1/\varepsilon)\rceil$
+gives false positives with probability at most~$\varepsilon$.
+It requires $\O(m\log(1/\varepsilon))$ bits of memory and both
+\alg{Insert} and \alg{Lookup} run in time $\O(k)$.
+}
+In the example with $n=10^6$ and $\varepsilon=0.01$, we get $m=2\cdot 10^6$
+and $k=7$, so the whole filter requires 14\thinspace Mb. If we decrease
+$\varepsilon$ to $0.001$, we have to increase~$k$ only to~10, so the memory
+consumption reaches only 20\thinspace Mb.
+\subsection{Optimizing parameters}
+The multi-band filter works well, but it turns out that we can fine-tune its parameters
+to obtain even better results (although only by a~constant factor). We can view it as
+an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
+such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
+We will assume that all hash functions are perfectly random.
+Let us focus on a~single band first. If we select its size~$m$, we can easily
+calculate probability that a~given bit is zero. We have $n$~items, each of them hashed
+to this bit with probability $1/m$. So the bit remains zero with probability $(1-1/m)^n$.
+This is approximately $p = \e^{-n/m}$.
+We will show that if we set~$p$, all other parameters are uniquely determined and so
+is the probability of false positives. We will find~$p$ such that this probability is
+minimized.
+If we set~$p$, it follows that $m \approx -n / \ln p$. Since all bands must fit in $M$~bits
+of memory, we must have $k = \lfloor M/m\rfloor \approx -M/n \cdot \ln p$ bands. False
+positives occur if we find~1 in all bands, which has probability
+$$
+	(1-p)^k \approx
+	\e^{k\ln(1-p)} \approx
+	\e^{-M/n \cdot \ln p \cdot \ln(1-p)}.
+$$
+As $\e^x$ is an increasing function, it suffices to minimize $\ln p \cdot \ln (1-p)$
+for $p\in(0,1)$. By elementary calculus, the minimum is attained for $p = 1/2$. This
+leads to false positive probability $(1/2)^k = 2^{-k}$. If we want to push this under~$\varepsilon$,
+we want to set $k = \lceil\log(1/\varepsilon)\rceil$,
+so $M = kn / \ln 2 \approx n \cdot \log(1/\varepsilon) \cdot (1/\ln 2) \doteq
+n \cdot \log(1/\varepsilon) \cdot 1.44$.
+This improves the constant~2 from the previous construction to approximately 1.44
+(TODO).
+TODO: Lower bound.
+\subsection{Merged filters}
+TODO
+\subsection{Counting filters}
+TODO
+\subsection{Representing functions: the Bloomier filters}
+TODO
 \endchapter