Skip to content
Snippets Groups Projects
Commit 07035907 authored by Martin Mareš's avatar Martin Mareš
Browse files

Merge branch 'master' of gitlab.kam.mff.cuni.cz:mj/dsbook

parents fe7ab7f1 e7889d63
No related branches found
No related tags found
No related merge requests found
...@@ -753,4 +753,122 @@ is at most a~constant. This concludes the proof of the theorem. ...@@ -753,4 +753,122 @@ is at most a~constant. This concludes the proof of the theorem.
TODO: Concentration inequalities and 5-independence. TODO: Concentration inequalities and 5-independence.
\section{Bloom filters}
Bloom filters are a~family of data structures for approximate representation of sets
in a~small amount of memory. A~Bloom filter starts with an empty set. Then it supports
insertion of new elements and membership queries. Sometimes, the filter gives a~\em{false
positive} answer: it answers {\csc yes} even though the element is not in the set.
We will calculate the probability of false positves and decrease it at the expense of
making the structure slightly larger. False negatives will never occur.
\subsection{A trivial example}
We start with a~very simple filter. Let~$h$ be a~hash function from a~universe~$\cal U$
to $[m]$, picked at random from a~$c$-universal family. For simplicity, we will assume
that $c=1$. The output of the hash function will serve as an~index to an~array
$B[0\ldots m-1]$ of bits.
At the beginning, all bits of the array are zero.
When we insert an element~$x$, we simply set the bit $B[h(x)]$ to~1.
A~query for~$x$ tests the bit $B[h(x)]$ and answers {\csc yes} iff the bit is set to~1.
(We can imagine that we are hashing items to $m$~buckets, but we store only which
buckets are non-empty.)
Suppose that we have already inserted items $x_1,\ldots,x_n$. If we query the filter
for any~$x_i$, it always answers {\csc yes.} But if we ask for a~$y$ different from
all~$x_i$'s, we can get a~false positive answer if $x$~falls to the same bucket
as one of the $x_i$'s.
Let us calculate the probability of a~false positive answer.
For a~concrete~$i$, we have $\Pr_h[h(y) = h(x_i)] \le 1/m$ by 1-universality.
By union bound, the probability that $h(y) = h(x_i)$ for least one~$i$
is at most $n/m$.
We can ask an~inverse question, too: how large filter do we need to push error
probability under some $\varepsilon>0$? By our calculation, $\lceil n/\varepsilon\rceil$
bits suffice. It is interesting that this size does not depend on the size of the universe
--- all previous data structures required at least $\log\vert{\cal U}\vert$ bits per item.
On the other hand, the size scales badly with error probability: for example,
a~filter for $10^6$ items with $\varepsilon = 0.01$ requires 100\thinspace Mb.
\subsection{Multi-band filters}
To achieve the same error probability in smaller space, we can simply run
multiple filters in parallel. We choose $k$~hash functions $h_1,\ldots,h_k$,
where $h_i$~maps the universe to a~separate array~$B_i$ of $m$~bits. Each
pair $(B_i,h_i)$ is called a~\em{band} of the filter.
Insertion adds the new item to all bands. A~query asks all bands and it answers
{\csc yes} only if each band answered {\csc yes}.
We shall calculate error probability of the $k$-band filter. Suppose that we set
$m=2n$, so that each band gives a~false positive with probability at most $1/2$.
The whole filter gives a~false positive only if all bands did, which happens with
probability at most $2^{-k}$ if the functons $h_1,\ldots,h_k$ where chosen independently.
This proves the following theorem.
\theorem{
Let $\varepsilon > 0$ be the desired error probability
and $n$~the maximum number of items in the set.
The $k$-band Bloom filter with $m=2n$ and $k = \lceil \log (1/\varepsilon)\rceil$
gives false positives with probability at most~$\varepsilon$.
It requires $\O(m\log(1/\varepsilon))$ bits of memory and both
\alg{Insert} and \alg{Lookup} run in time $\O(k)$.
}
In the example with $n=10^6$ and $\varepsilon=0.01$, we get $m=2\cdot 10^6$
and $k=7$, so the whole filter requires 14\thinspace Mb. If we decrease
$\varepsilon$ to $0.001$, we have to increase~$k$ only to~10, so the memory
consumption reaches only 20\thinspace Mb.
\subsection{Optimizing parameters}
The multi-band filter works well, but it turns out that we can fine-tune its parameters
to obtain even better results (although only by a~constant factor). We can view it as
an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
We will assume that all hash functions are perfectly random.
Let us focus on a~single band first. If we select its size~$m$, we can easily
calculate probability that a~given bit is zero. We have $n$~items, each of them hashed
to this bit with probability $1/m$. So the bit remains zero with probability $(1-1/m)^n$.
This is approximately $p = \e^{-n/m}$.
We will show that if we set~$p$, all other parameters are uniquely determined and so
is the probability of false positives. We will find~$p$ such that this probability is
minimized.
If we set~$p$, it follows that $m \approx -n / \ln p$. Since all bands must fit in $M$~bits
of memory, we must have $k = \lfloor M/m\rfloor \approx -M/n \cdot \ln p$ bands. False
positives occur if we find~1 in all bands, which has probability
$$
(1-p)^k \approx
\e^{k\ln(1-p)} \approx
\e^{-M/n \cdot \ln p \cdot \ln(1-p)}.
$$
As $\e^x$ is an increasing function, it suffices to minimize $\ln p \cdot \ln (1-p)$
for $p\in(0,1)$. By elementary calculus, the minimum is attained for $p = 1/2$. This
leads to false positive probability $(1/2)^k = 2^{-k}$. If we want to push this under~$\varepsilon$,
we want to set $k = \lceil\log(1/\varepsilon)\rceil$,
so $M = kn / \ln 2 \approx n \cdot \log(1/\varepsilon) \cdot (1/\ln 2) \doteq
n \cdot \log(1/\varepsilon) \cdot 1.44$.
This improves the constant~2 from the previous construction to approximately 1.44
(TODO).
TODO: Lower bound.
\subsection{Merged filters}
TODO
\subsection{Counting filters}
TODO
\subsection{Representing functions: the Bloomier filters}
TODO
\endchapter \endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment