Bloom filters: Optimization cont'd

2682e01d · Martin Mareš · 07035907 · 2682e01d
Commit 2682e01d authored 6 years ago by Martin Mareš
--- a/06-hash/hash.tex
+++ b/06-hash/hash.tex
@@ -813,7 +813,7 @@ Let $\varepsilon > 0$ be the desired error probability
 and $n$~the maximum number of items in the set.
 The $k$-band Bloom filter with $m=2n$ and $k = \lceil \log (1/\varepsilon)\rceil$
 gives false positives with probability at most~$\varepsilon$.
-It requires $\O(m\log(1/\varepsilon))$ bits of memory and both
+It requires $2m\lceil\log (1/\varepsilon)\rceil$ bits of memory and both
 \alg{Insert} and \alg{Lookup} run in time $\O(k)$.
 }

@@ -825,10 +825,11 @@ consumption reaches only 20\thinspace Mb.
 \subsection{Optimizing parameters}

 The multi-band filter works well, but it turns out that we can fine-tune its parameters
-to obtain even better results (although only by a~constant factor). We can view it as
-an~optimization problem: given a~memory budget of~$M$ bits, set the parameters $m$ and~$k$
-such that the filter fits in memory ($mk \le M$) and the error probability is minimized.
-We will assume that all hash functions are perfectly random.
+to improve memory consumption by a~constant factor. We can view it as
+an~optimization problem: given a~memory budget of~$M$ bits, set the parameters
+$m$ and~$k$ such that the filter fits in memory ($mk \le M$) and the error
+probability is minimized. We will assume that all hash functions are perfectly
+random.

 Let us focus on a~single band first. If we select its size~$m$, we can easily
 calculate probability that a~given bit is zero. We have $n$~items, each of them hashed
@@ -840,26 +841,29 @@ is the probability of false positives. We will find~$p$ such that this probabili
 minimized.

 If we set~$p$, it follows that $m \approx -n / \ln p$. Since all bands must fit in $M$~bits
-of memory, we must have $k = \lfloor M/m\rfloor \approx -M/n \cdot \ln p$ bands. False
+of memory, we want to use $k = \lfloor M/m\rfloor \approx -M/n \cdot \ln p$ bands. False
 positives occur if we find~1 in all bands, which has probability
 $$
 	(1-p)^k \approx
 	\e^{k\ln(1-p)} \approx
 	\e^{-M/n \cdot \ln p \cdot \ln(1-p)}.
 $$
-As $\e^x$ is an increasing function, it suffices to minimize $\ln p \cdot \ln (1-p)$
-for $p\in(0,1)$. By elementary calculus, the minimum is attained for $p = 1/2$. This
+As $\e^{-x}$ is a~decreasing function, it suffices to maximize $\ln p \cdot \ln (1-p)$
+for $p\in(0,1)$. By elementary calculus, the maximum is attained for $p = 1/2$. This
 leads to false positive probability $(1/2)^k = 2^{-k}$. If we want to push this under~$\varepsilon$,
-we want to set $k = \lceil\log(1/\varepsilon)\rceil$,
+we set $k = \lceil\log(1/\varepsilon)\rceil$,
 so $M = kn / \ln 2 \approx n \cdot \log(1/\varepsilon) \cdot (1/\ln 2) \doteq
 n \cdot \log(1/\varepsilon) \cdot 1.44$.
+% TODO: Plot ln(p)*ln(1-p)

-This improves the constant~2 from the previous construction to approximately 1.44
-(TODO).
+This improves the constant from the previous theorem from~2 to circa 1.44.

-TODO: Lower bound.
+\note{It is known that any approximate membership data structure with false positive
+probability~$\varepsilon$ and no false negatives must use at least $n\log(1/\varepsilon)$
+bits of memory. The optimized Bloom filter is therefore within a~factor of 1.44 from the
+optimum.}

-\subsection{Merged filters}
+\subsection{Single-table filters}

 TODO