Skip to content
Snippets Groups Projects
Commit 2682e01d authored by Martin Mareš's avatar Martin Mareš
Browse files

Bloom filters: Optimization cont'd

parent 07035907
No related branches found
No related tags found
No related merge requests found
......@@ -813,7 +813,7 @@ Let $\varepsilon > 0$ be the desired error probability
and $n$~the maximum number of items in the set.
The $k$-band Bloom filter with $m=2n$ and $k = \lceil \log (1/\varepsilon)\rceil$
gives false positives with probability at most~$\varepsilon$.
It requires $\O(m\log(1/\varepsilon))$ bits of memory and both
It requires $2m\lceil\log (1/\varepsilon)\rceil$ bits of memory and both
\alg{Insert} and \alg{Lookup} run in time $\O(k)$.
}
......@@ -825,10 +825,11 @@ consumption reaches only 20\thinspace Mb.
\subsection{Optimizing parameters}
The multi-band filter works well, but it turns out that we can fine-tune its parameters
to obtain even better results (although only by a~constant factor). We can view it as
an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
We will assume that all hash functions are perfectly random.
to improve memory consumption by a~constant factor. We can view it as
an~optimization problem: given a~memory budget of~$M$ bits, set the parameters
$m$ and~$k$ such that the filter fits in memory ($mk \le M$) and the error
probability is minimized. We will assume that all hash functions are perfectly
random.
Let us focus on a~single band first. If we select its size~$m$, we can easily
calculate probability that a~given bit is zero. We have $n$~items, each of them hashed
......@@ -840,26 +841,29 @@ is the probability of false positives. We will find~$p$ such that this probabili
minimized.
If we set~$p$, it follows that $m \approx -n / \ln p$. Since all bands must fit in $M$~bits
of memory, we must have $k = \lfloor M/m\rfloor \approx -M/n \cdot \ln p$ bands. False
of memory, we want to use $k = \lfloor M/m\rfloor \approx -M/n \cdot \ln p$ bands. False
positives occur if we find~1 in all bands, which has probability
$$
(1-p)^k \approx
\e^{k\ln(1-p)} \approx
\e^{-M/n \cdot \ln p \cdot \ln(1-p)}.
$$
As $\e^x$ is an increasing function, it suffices to minimize $\ln p \cdot \ln (1-p)$
for $p\in(0,1)$. By elementary calculus, the minimum is attained for $p = 1/2$. This
As $\e^{-x}$ is a~decreasing function, it suffices to maximize $\ln p \cdot \ln (1-p)$
for $p\in(0,1)$. By elementary calculus, the maximum is attained for $p = 1/2$. This
leads to false positive probability $(1/2)^k = 2^{-k}$. If we want to push this under~$\varepsilon$,
we want to set $k = \lceil\log(1/\varepsilon)\rceil$,
we set $k = \lceil\log(1/\varepsilon)\rceil$,
so $M = kn / \ln 2 \approx n \cdot \log(1/\varepsilon) \cdot (1/\ln 2) \doteq
n \cdot \log(1/\varepsilon) \cdot 1.44$.
% TODO: Plot ln(p)*ln(1-p)
This improves the constant~2 from the previous construction to approximately 1.44
(TODO).
This improves the constant from the previous theorem from~2 to circa 1.44.
TODO: Lower bound.
\note{It is known that any approximate membership data structure with false positive
probability~$\varepsilon$ and no false negatives must use at least $n\log(1/\varepsilon)$
bits of memory. The optimized Bloom filter is therefore within a~factor of 1.44 from the
optimum.}
\subsection{Merged filters}
\subsection{Single-table filters}
TODO
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment