Skip to content
Snippets Groups Projects
Commit 319acc0d authored by Martin Mareš's avatar Martin Mareš
Browse files

Bloom filters: more variants

parent ec53dced
No related branches found
No related tags found
No related merge requests found
......@@ -865,14 +865,79 @@ optimum.}
\subsection{Single-table filters}
TODO
It is also possible to construct a~Bloom filter, where multiple hash functions
point to bits in a~shared table. (In fact, this was the original construction by Bloom.)
Consider $k$~hash functions $h_1,\ldots,h_k$ mapping the universe to~$[m]$ and a~bit array
$B[0,\ldots,m-1]$. $\alg{Insert}(x)$ sets the bits $B[h_1(x)],\ldots,B[h_k(x)]$ to~1.
$\alg{Lookup}(x)$ returns {\csc yes}, if all these bits are set.
This filter can be analysed similarly to the $k$-band version. We will assume that
all hash functions are perfectly random and mutually independent.
Insertion of $n$~elements sets $kn$ bits (not necessarily distinct), so the
probability that a~fixed bit $B[i]$ is set is $(1-1/m)^{nk}$, which is approximately
$p = \e^{-nk/m}$. We will find the optimum value of~$p$, for which the probability
of false positives is minimized. For fixed~$m$, we get $k = -m/n\cdot\ln p$.
We get a~false positive if all bits $B[h_i(x)]$ are set. This happens with probability
approximately\foot{We are cheating a~little bit here: the events $B[i]=1$
for different~$i$ are not mutually independent. However, further analysis shows that
they are very little correlated, so our approximation holds.}
$(1-p)^k = (1-p)^{-m/n\cdot\ln p} = \exp(-m/n\cdot\ln p\cdot\ln (1-p))$.
Again, this is minimized for $p = 1/2$. So for a~fixed error probability~$\varepsilon$,
we get $k = \lceil\log(1/\varepsilon)\rceil$ and $m = kn / \ln 2 \doteq 1.44\cdot
n\cdot\lceil\log(1/\varepsilon)\rceil$.
We see that as far as our approximation can tell, single-table Bloom filters
achieve the same performance as the $k$-band version.
% TODO
% \subsection{Set operations}
\subsection{Counting filters}
TODO
An~ordinary Bloom filter does not support deletion: when we delete an~item, we do not
know if some of its bits are shared with other items. There is an~easy solution: instead
of bits, keep $b$-bit counters $C[0\ldots m-1]$. \alg{Insert} increments the counters, \alg{Delete} decrements
them, and \alg{Lookup} returns {\csc yes} if all counters are non-zero.
\subsection{Representing functions: the Bloomier filters}
However, since the counters have limited range, they can overflow. We will handle overflows
by keeping the counter at the maximum allowed value $2^b-1$, which will not be changed by
subsequent insertions nor deletions. We say that the counter is \em{stuck.} Obviously,
too many stuck counters will degrade the data structure. We will show that this happens
with small probability only.
TODO
We will assume a~single-band filter with one fully random hash function and $m$~counters after
insertion of~$n$ items. For fixed counter value~$t$, we have
$$
\Pr[C[i]=t] = {n\choose t}\cdot \left(1\over m\right)^t \cdot \left(1 - {1\over m}\right)^{n-t},
$$
because for each of $n\choose t$ $t$-tuples we have probability $(1/m)^t$ that the
tuple is hashed to~$i$ and probability $(1-1/m)^{n-t}$ that all other items are
hashed elsewhere.
If $C[i]\ge t$, there must exist a~$t$-tuple hashed to~$i$ and the remaining items
can be hashed anywhere. Therefore:
$$
\Pr[C[i]\ge t] \le {n\choose t}\cdot \left(1\over m\right)^t.
$$
Since ${n\choose t} \le (n\e/t)^t$, we have
$$
\Pr[C[i]\ge t] \le \left( n\e \over t \right)^t \cdot \left( 1\over m \right)^t
= \left( ne \over mt \right)^t.
$$
As we already know that the optimum~$m$ is approximately $n / \ln 2$, the probability is
at most $(\e\ln 2 / t)^t$.
By union bound, the probability that there exists a~stuck counter is at most $m$-times more.
\example{
A~4-bit counter is stuck when it reaches $t=15$, which by our bound happens with probability at most $3.06\cdot 10^{-14}$.
If we have $m = 10^9$ counters, the probability that any is stuck is at most $3.06\cdot 10^5$.
So for any reasonably large table, 4-bit counters are sufficient and they seldom get stuck.
Of course, for a~very long sequence of operations, stuck counters eventually accumulate,
so we should preferably rebuild the structure occasionally.
}
% TODO
% \subsection{Representing functions: the Bloomier filters}
\endchapter
......@@ -214,7 +214,9 @@
% Poznamky pod carou
\newcount\footcnt
\footcnt=0
\def\foot#1{\global\advance\footcnt by 1\footmark{\the\footcnt}%
\def\foot#1{%
\nobreak\hskip 0pt % Allow hyphenation of the preceding word
\global\advance\footcnt by 1\footmark{\the\footcnt}%
\insert\footins{
\interlinepenalty=\interfootnotelinepenalty
\splittopskip=\ht\strutbox
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment