From defbcae6a4f9412b4170ce9c68c763aee43a1df6 Mon Sep 17 00:00:00 2001 From: Parth Mittal <parth15069@iiitd.ac.in> Date: Fri, 30 Apr 2021 23:04:12 +0530 Subject: [PATCH] wrote count-min and the AMS estimator for distinct --- streaming/streaming.tex | 134 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 130 insertions(+), 4 deletions(-) diff --git a/streaming/streaming.tex b/streaming/streaming.tex index 5cc4b71..687d848 100644 --- a/streaming/streaming.tex +++ b/streaming/streaming.tex @@ -43,7 +43,7 @@ us with a small set $C$ containing $F_k$, and hence lets us solve the frequent elements problem efficiently. \algo{FrequencyEstimate} \algalias{Misra/Gries Algorithm} -\algin the data stream $\alpha$, the target for the estimator $k$ +\algin the data stream $\alpha$, the target for the estimator $k$. \:\em{Init}: $A \= \emptyset$. \cmt{an empty map} \:\em{Process}($x$): \::If $x \in$ keys($A$), $A[x] \= A[x] + 1$. @@ -93,12 +93,138 @@ $\vert C \vert = \vert$keys($A$)$\vert \leq k - 1$, and a key-value pair can be stored in $\O(\log n + \log m)$ bits. \qed -\subsection{The Count-Min sketch} +\subsection{The Count-Min Sketch} + +We will now look at a randomized streaming algorithm that solves the +frequency estimation problem. While this algorithm can fail with some +probability, it has the advantage that the output on two different streams +can be easily combined. + +\algo{FrequencyEstimate} \algalias{Count-Min Sketch} +\algin the data stream $\alpha$, the accuracy $\varepsilon$, + the error parameter $\delta$. +\:\em{Init}: + $C[1\ldots t][1\ldots k] \= 0$, where $k \= \lceil 2 / \varepsilon \rceil$ + and $t \= \lceil \log(1 / \delta) \rceil$. +\:: Choose $t$ independent hash functions $h_1, \ldots h_t : [n] \to [k]$, each + from a 2-independent family. +\:\em{Process}($x$): +\::For $i \in [t]$: $C[i][h_i(x)] \= C[i][h_i(x)] + 1$. +\algout Report $\hat{f}_a = \min_{i \in t} C[i][h_i(a)]$. +\endalgo -We will now look at a randomized streaming algorithm that performs the same task +Note that the algorithm needs $\O(tk \log m)$ bits to store the table $C$, and +$\O(t \log n)$ bits to store the hash functions $h_1, \ldots h_t$, and hence +uses $\O(1/\varepsilon \cdot \log (1 / \delta) \cdot \log m + + \log (1 / \delta)\cdot \log n)$ bits. It remains to show that it computes +a good estimate. -\endchapter +\lemma{ + $f_a \leq \hat{f}_a \leq f_a + \varepsilon m$ with probability $\delta$. +} + +\proof +Clearly $\hat{f}_a \geq f_a$ for all $a \in [n]$; we will show that +$\hat{f}_a \leq f_a + \varepsilon m$ with probability at least $\delta$. +For a fixed element $a$, define the random variable +$$X_i := C[i][h_i(a)] - f_a$$ +For $j \in [n] \setminus \{ a \}$, define the +indicator variable $Y_{i, j} := [ h_i(j) = h_i(a) ]$. Then we can see that +$$X_i = \sum_{j \neq a} f_j\cdot Y_{i, j}$$ + +Note that $\E[Y_{i, j}] = 1/k$ since each $h_i$ is from a 2-independent family, +and hence by linearity of expectation: +$$\E[X_i] = {\vert\vert f \vert\vert_1 - f_a \over k} = + {\vert\vert f_{-a} \vert\vert_1 \over k}$$ + +And by applying Markov's inequality we obtain a bound on the error of a single +counter: +$$ \Pr[X_i > \varepsilon \cdot m ] \geq + \Pr[ X_i > \varepsilon \cdot \vert\vert f_{-a} \vert\vert_1 ] \leq + {1 \over k\varepsilon} \leq 1/2$$ + +Finally, since we have $t$ independent counters, the probability that they +are all wrong is: + +$$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$ +\qed + +\section{Counting Distinct Elements} +We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$, +and define $f_a$ (the frequency of $a$) as before. Let +$d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is +to estimate $d$. +\subsection{The AMS Algorithm} +Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$ + (i.e. the number of trailing zeroes in the base-2 representation of $x$). +\algo{DistinctElements} \algalias{AMS} +\algin the data stream $\alpha$, the accuracy $\varepsilon$, + the error parameter $\delta$. +\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent + family. +\:: $z \= 0$. +\:\em{Process}($x$): +\::If ${\tt tz}(h(x)) > z$: $z \= {\tt tz}(h(x))$. +\algout $\hat{d} \= 2^{z + 1/2}$ +\endalgo +\lemma{ + The AMS algorithm is a $(3, \delta)$-estimator for some constant + $\delta$. +} +\proof +For $j \in [n]$, $r \geq 0$, let $X_{r, j} := [ {\tt tz}(h(j)) \geq r ]$, the +indicator that is true if $h(j)$ has at least $r$ trailing $0$s. +Now define $$ Y_r = \sum_{j : f_j > 0} X_{r, j} $$ +How is our estimate related to $Y_r$? If the algorithm outputs +$\hat{d} \geq 2^{a + 1/2}$, then we know that $Y_a > 0$. Similarly, if the +output is smaller than $2^{a + 1/2}$, then we know that $Y_a = 0$. We will now +bound the probabilities of these events. + +For any $j \in [n]$, $h(j)$ is uniformly distributed over $[n]$ (since $h$ +is $2$-independent). Hence $\E[X_{r, j}] = 1 / 2^r$. By linearity of +expectation, $\E[Y_{r}] = d / 2^r$. + +We will also use the variance of these variables -- note that +$${\rm Var}[X_{r, j}] \leq \E[X_{r, j}^2] = \E[X_{r, j}] = 1/2^r$$ + +And because $h$ is $2$-independent, the variables $X_{r, j}$ and $X_{r, j'}$ +are independent for $j \neq j'$, and hence: +$${{\rm Var}}[Y_{r}] = \sum_{j : f_j > 0} {\rm Var}[X_{r, j}] \leq d / 2^r $$ + +Now, let $a$ be the smallest integer such that $2^{a + 1/2} \geq 3d$. Then we +have: +$$ \Pr[\hat{d} \geq 3d] = \Pr[Y_a > 0] = \Pr[Y_a \geq 1] $$ + +Using Markov's inequality we get: +$$ \Pr[\hat{d} \geq 3d] \leq \E[Y_a] = {d \over 2^a} \leq {\sqrt{2} \over 3} $$ + +For the other side, let $b$ be the smallest integer so that +$2^{b + 1/2} \leq d/3$. Then we have: +$$ \Pr[\hat{d} \leq d / 3] = \Pr[ Y_{b + 1} = 0] \leq + \Pr[ \vert Y_{b + 1} - \E[Y_{b + 1}] \vert \geq d / 2^{b + 1} ]$$ + +Using Chebyshev's inequality, we get: +$$ \Pr[\hat{d} < d / 3] \leq {{\rm Var}[Y_b] \over (d / 2^{b + 1})^2} \leq + {2^{b + 1} \over d} \leq {\sqrt{2} \over 3}$$ + +\qed + +The previous algorithm is not particularly satisfying -- by our analysis it +can make an error around $94\%$ of the time (taking the union of the two bad +events). However we can improve the success probability easily; we run $t$ +independent estimators simultaneously, and print the median of their outputs. +By a standard use of Chernoff Bounds one can show that the probability that +the median is more than $3d$ is at most $2^{-\Theta(t)}$ (and similarly also +the probability that it is less than $d / 3$). + +Hence it is enough to run $\O(\log (1/ \delta))$ copies of the AMS estimator +to get a $(3, \delta)$ estimator for any $\delta > 0$. Finally, we note that +the space used by a single estimator is $\O(\log n)$ since we can store $h$ in +$\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$ +estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits. + +\endchapter -- GitLab