wrote count-min and the AMS estimator for distinct

defbcae6 · Parth Mittal · 9744cd1b · defbcae6
Commit defbcae6 authored 4 years ago by Parth Mittal
--- a/streaming/streaming.tex
+++ b/streaming/streaming.tex
@@ -43,7 +43,7 @@ us with a small set $C$ containing $F_k$, and hence lets us solve the frequent
 elements problem efficiently.
 \algo{FrequencyEstimate} \algalias{Misra/Gries Algorithm}
-\algin the data stream $\alpha$, the target for the estimator $k$
+\algin the data stream $\alpha$, the target for the estimator $k$.
 \:\em{Init}: $A \= \emptyset$. \cmt{an empty map}
 \:\em{Process}($x$): 
 \::If $x \in$ keys($A$), $A[x] \= A[x] + 1$.
@@ -93,12 +93,138 @@ $\vert C \vert = \vert$keys($A$)$\vert \leq k - 1$, and a key-value pair can
 be stored in $\O(\log n + \log m)$ bits.
 \qed
-\subsection{The Count-Min sketch}
+\subsection{The Count-Min Sketch}
+We will now look at a randomized streaming algorithm that solves the 
+frequency estimation problem. While this algorithm can fail with some
+probability, it has the advantage that the output on two different streams
+can be easily combined.
+\algo{FrequencyEstimate} \algalias{Count-Min Sketch}
+\algin the data stream $\alpha$, the accuracy $\varepsilon$,
+ the error parameter $\delta$.
+\:\em{Init}:
+ $C[1\ldots t][1\ldots k] \= 0$, where $k \= \lceil 2 / \varepsilon \rceil$
+ and $t \= \lceil \log(1 / \delta) \rceil$.
+\:: Choose $t$ independent hash functions $h_1, \ldots h_t : [n] \to [k]$, each
+ from a 2-independent family.
+\:\em{Process}($x$): 
+\::For $i \in [t]$: $C[i][h_i(x)] \= C[i][h_i(x)] + 1$.
+\algout Report $\hat{f}_a = \min_{i \in t} C[i][h_i(a)]$.
+\endalgo
-We will now look at a randomized streaming algorithm that performs the same task 
+Note that the algorithm needs $\O(tk \log m)$ bits to store the table $C$, and
+$\O(t \log n)$ bits to store the hash functions $h_1, \ldots h_t$, and hence
+uses $\O(1/\varepsilon \cdot \log (1 / \delta) \cdot \log m
+ + \log (1 / \delta)\cdot  \log n)$ bits. It remains to show that it computes
+a good estimate.
-\endchapter
+\lemma{
+    $f_a \leq \hat{f}_a \leq f_a + \varepsilon m$ with probability $\delta$.
+}
+\proof
+Clearly $\hat{f}_a \geq f_a$ for all $a \in [n]$; we will show that
+$\hat{f}_a \leq f_a + \varepsilon m$ with probability at least $\delta$.
+For a fixed element $a$, define the random variable
+$$X_i := C[i][h_i(a)] - f_a$$
+For $j \in [n] \setminus \{ a \}$, define the
+indicator variable $Y_{i, j} := [ h_i(j) = h_i(a) ]$. Then we can see that
+$$X_i = \sum_{j \neq a} f_j\cdot Y_{i, j}$$
+Note that $\E[Y_{i, j}] = 1/k$ since each $h_i$ is from a 2-independent family,
+and hence by linearity of expectation:
+$$\E[X_i] = {\vert\vert f \vert\vert_1 - f_a \over k} =
+ {\vert\vert f_{-a} \vert\vert_1 \over k}$$
+And by applying Markov's inequality we obtain a bound on the error of a single
+counter:
+$$ \Pr[X_i > \varepsilon \cdot m ] \geq
+ \Pr[ X_i > \varepsilon \cdot \vert\vert f_{-a} \vert\vert_1 ] \leq
+ {1 \over k\varepsilon} \leq 1/2$$
+Finally, since we have $t$ independent counters, the probability that they
+are all wrong is:
+$$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$
+\qed
+\section{Counting Distinct Elements}
+We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$,
+and define $f_a$ (the frequency of $a$) as before. Let
+$d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
+to estimate $d$.
+\subsection{The AMS Algorithm}
+Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$
+ (i.e. the number of trailing zeroes in the base-2 representation of $x$).
+\algo{DistinctElements} \algalias{AMS}
+\algin the data stream $\alpha$, the accuracy $\varepsilon$,
+ the error parameter $\delta$.
+\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
+ family.
+\:: $z \= 0$.
+\:\em{Process}($x$): 
+\::If ${\tt tz}(h(x)) > z$: $z \= {\tt tz}(h(x))$.
+\algout $\hat{d} \= 2^{z + 1/2}$
+\endalgo
+\lemma{
+    The AMS algorithm is a $(3, \delta)$-estimator for some constant
+    $\delta$.
+}
+\proof
+For $j \in [n]$, $r \geq 0$, let $X_{r, j} := [ {\tt tz}(h(j)) \geq r ]$, the
+indicator that is true if $h(j)$ has at least $r$ trailing $0$s.
+Now define $$ Y_r = \sum_{j : f_j > 0} X_{r, j} $$
+How is our estimate related to $Y_r$? If the algorithm outputs
+$\hat{d} \geq 2^{a + 1/2}$, then we know that $Y_a > 0$. Similarly, if the
+output is smaller than $2^{a + 1/2}$, then we know that $Y_a = 0$. We will now
+bound the probabilities of these events.
+For any $j \in [n]$, $h(j)$ is uniformly distributed over $[n]$ (since $h$
+is $2$-independent). Hence $\E[X_{r, j}] = 1 / 2^r$. By linearity of
+expectation, $\E[Y_{r}] = d / 2^r$.
+We will also use the variance of these variables -- note that
+$${\rm Var}[X_{r, j}] \leq \E[X_{r, j}^2] = \E[X_{r, j}] = 1/2^r$$
+And because $h$ is $2$-independent, the variables $X_{r, j}$ and $X_{r, j'}$
+are independent for $j \neq j'$, and hence:
+$${{\rm Var}}[Y_{r}] = \sum_{j : f_j > 0} {\rm Var}[X_{r, j}] \leq d / 2^r $$
+Now, let $a$ be the smallest integer such that $2^{a + 1/2} \geq 3d$. Then we
+have:
+$$ \Pr[\hat{d} \geq 3d] = \Pr[Y_a > 0] = \Pr[Y_a \geq 1] $$
+Using Markov's inequality we get:
+$$ \Pr[\hat{d} \geq 3d] \leq \E[Y_a] = {d \over 2^a} \leq {\sqrt{2} \over 3} $$
+For the other side, let $b$ be the smallest integer so that
+$2^{b + 1/2} \leq d/3$. Then we have:
+$$ \Pr[\hat{d} \leq d / 3] = \Pr[ Y_{b + 1} = 0] \leq
+ \Pr[ \vert Y_{b + 1} - \E[Y_{b + 1}] \vert \geq d / 2^{b + 1} ]$$
+Using Chebyshev's inequality, we get:
+$$ \Pr[\hat{d} < d / 3] \leq {{\rm Var}[Y_b] \over (d / 2^{b + 1})^2} \leq
+ {2^{b + 1} \over d} \leq {\sqrt{2} \over 3}$$
+\qed
+The previous algorithm is not particularly satisfying -- by our analysis it
+can make an error around $94\%$ of the time (taking the union of the two bad
+events). However we can improve the success probability easily; we run $t$
+independent estimators simultaneously, and print the median of their outputs.
+By a standard use of Chernoff Bounds one can show that the probability that
+the median is more than $3d$ is at most $2^{-\Theta(t)}$ (and similarly also
+the probability that it is less than $d / 3$).
+Hence it is enough to run $\O(\log (1/ \delta))$ copies of the AMS estimator
+to get a $(3, \delta)$ estimator for any $\delta > 0$. Finally, we note that
+the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
+$\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$
+estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits.
+\endchapter