Bloom filters: more variants

319acc0d · Martin Mareš · ec53dced · 319acc0d · 319acc0d
Commit 319acc0d authored 6 years ago by Martin Mareš
--- a/06-hash/hash.tex
+++ b/06-hash/hash.tex
@@ -865,14 +865,79 @@ optimum.}

 \subsection{Single-table filters}

-TODO
+It is also possible to construct a~Bloom filter, where multiple hash functions
+point to bits in a~shared table. (In fact, this was the original construction by Bloom.)
+Consider $k$~hash functions $h_1,\ldots,h_k$ mapping the universe to~$[m]$ and a~bit array
+$B[0,\ldots,m-1]$. $\alg{Insert}(x)$ sets the bits $B[h_1(x)],\ldots,B[h_k(x)]$ to~1.
+$\alg{Lookup}(x)$ returns {\csc yes}, if all these bits are set.
+
+This filter can be analysed similarly to the $k$-band version. We will assume that
+all hash functions are perfectly random and mutually independent.
+
+Insertion of $n$~elements sets $kn$ bits (not necessarily distinct), so the
+probability that a~fixed bit $B[i]$ is set is $(1-1/m)^{nk}$, which is approximately
+$p = \e^{-nk/m}$. We will find the optimum value of~$p$, for which the probability
+of false positives is minimized. For fixed~$m$, we get $k = -m/n\cdot\ln p$.
+
+We get a~false positive if all bits $B[h_i(x)]$ are set. This happens with probability
+approximately\foot{We are cheating a~little bit here: the events $B[i]=1$
+for different~$i$ are not mutually independent. However, further analysis shows that
+they are very little correlated, so our approximation holds.}
+$(1-p)^k = (1-p)^{-m/n\cdot\ln p} = \exp(-m/n\cdot\ln p\cdot\ln (1-p))$.
+Again, this is minimized for $p = 1/2$. So for a~fixed error probability~$\varepsilon$,
+we get $k = \lceil\log(1/\varepsilon)\rceil$ and $m = kn / \ln 2 \doteq 1.44\cdot
+n\cdot\lceil\log(1/\varepsilon)\rceil$.
+
+We see that as far as our approximation can tell, single-table Bloom filters
+achieve the same performance as the $k$-band version.
+
+% TODO
+% \subsection{Set operations}

 \subsection{Counting filters}

-TODO
+An~ordinary Bloom filter does not support deletion: when we delete an~item, we do not
+know if some of its bits are shared with other items. There is an~easy solution: instead
+of bits, keep $b$-bit counters $C[0\ldots m-1]$. \alg{Insert} increments the counters, \alg{Delete} decrements
+them, and \alg{Lookup} returns {\csc yes} if all counters are non-zero.

-\subsection{Representing functions: the Bloomier filters}
+However, since the counters have limited range, they can overflow. We will handle overflows
+by keeping the counter at the maximum allowed value $2^b-1$, which will not be changed by
+subsequent insertions nor deletions. We say that the counter is \em{stuck.} Obviously,
+too many stuck counters will degrade the data structure. We will show that this happens
+with small probability only.

-TODO
+We will assume a~single-band filter with one fully random hash function and $m$~counters after
+insertion of~$n$ items. For fixed counter value~$t$, we have
+$$
+	\Pr[C[i]=t] = {n\choose t}\cdot \left(1\over m\right)^t \cdot \left(1 - {1\over m}\right)^{n-t},
+$$
+because for each of $n\choose t$ $t$-tuples we have probability $(1/m)^t$ that the
+tuple is hashed to~$i$ and probability $(1-1/m)^{n-t}$ that all other items are
+hashed elsewhere.
+If $C[i]\ge t$, there must exist a~$t$-tuple hashed to~$i$ and the remaining items
+can be hashed anywhere. Therefore:
+$$
+	\Pr[C[i]\ge t] \le {n\choose t}\cdot \left(1\over m\right)^t.
+$$
+Since ${n\choose t} \le (n\e/t)^t$, we have
+$$
+	\Pr[C[i]\ge t] \le \left( n\e \over t \right)^t \cdot \left( 1\over m \right)^t
+	= \left( ne \over mt \right)^t.
+$$
+As we already know that the optimum~$m$ is approximately $n / \ln 2$, the probability is
+at most $(\e\ln 2 / t)^t$.
+By union bound, the probability that there exists a~stuck counter is at most $m$-times more.
+
+\example{
+A~4-bit counter is stuck when it reaches $t=15$, which by our bound happens with probability at most $3.06\cdot 10^{-14}$.
+If we have $m = 10^9$ counters, the probability that any is stuck is at most $3.06\cdot 10^5$.
+So for any reasonably large table, 4-bit counters are sufficient and they seldom get stuck.
+Of course, for a~very long sequence of operations, stuck counters eventually accumulate,
+so we should preferably rebuild the structure occasionally.
+}
+
+% TODO
+% \subsection{Representing functions: the Bloomier filters}

 \endchapter
--- a/tex/adsmac.tex
+++ b/tex/adsmac.tex
@@ -214,7 +214,9 @@
 % Poznamky pod carou
 \newcount\footcnt
 \footcnt=0
-\def\foot#1{\global\advance\footcnt by 1\footmark{\the\footcnt}%
+\def\foot#1{%
+	\nobreak\hskip 0pt  % Allow hyphenation of the preceding word
+	\global\advance\footcnt by 1\footmark{\the\footcnt}%
 	\insert\footins{
 		\interlinepenalty=\interfootnotelinepenalty
 		\splittopskip=\ht\strutbox