diff --git a/streaming/streaming.tex b/streaming/streaming.tex index cf635cb3356b2132830368766b3152bdc77306dc..777eaae37c597167cbefee33866f7fda4e917983 100644 --- a/streaming/streaming.tex +++ b/streaming/streaming.tex @@ -38,4 +38,60 @@ Algorithm does. \subsection{Misra/Gries Algorithm} +TODO: Typeset the algorithm better. +\proc{FrequencyEstimate}$(\alpha, k)$ +\algin the data stream $\alpha$, the target for the estimator $k$ +\:Init: $A \= \emptyset$ +\:For $j$ a number from the stream: +\:If $j$ is a key in $A$, $A[j] \= A[j] + 1$. +\:Else If $\vert A \vert < k - 1$, add the key $j$ to $A$ and set $A[j] \= 1$. +\:Else For each key $\ell$ in $A$, reduce $A[\ell] \= A[\ell] - 1$. + Delete $\ell$ from $A$ if $A[\ell] = 0$. +\:After processing the entire stream, return A. +\endalgo + +Let us show that $A[j]$ is a good estimate for the frequency $f_j$. + +\lemma{ +$f_j - m / k \leq A[j] \leq f_j$ +} + +\proof +Suppose that $A$ maintains the value for each key $j \in [n]$ (instead of +just $k - 1$ of them). We can recast \alg{FrequencyEstimate} in this setting: +We always increment $A[j]$ on seeing $j$ in the stream, but if there are +$\geq k$ positive values $A[\ell]$ after this step, we decrease each of them +by 1. +In particular, this reduces the value of the most recently added key $A[j]$ +back to $0$. + +Now, we see immediately that $A[j] \leq f_j$, since it is only incremented when +we see $j$ in the stream. To see the other inequality, consider the potential +function $\Phi = \sum_{\ell} A[\ell]$. Note that $\Phi$ increases by exactly +$m$ (since the stream contains $m$ elements), and is decreased by $k$ every +time $A[j]$ decreases. Since $\Phi = 0$ initially and $\Phi \geq 0$, we get +that $A[j]$ is decreased at most $m / k$ times. +\qed + +Now, for $j \in F_k$, we know that $f_j > m / k$, which implies that $A[j] > 0$. +Hence $F_k \subseteq C = \{ j \mid A[j] > 0 \}$, and we have a $C$ of size +$k - 1$ ready for the second pass over the input. + +\theorem{ + There exists a deterministic 2-pass algorithm that finds $F_k$ in + $\O(k(\log n + \log m))$ space. +} +\proof +The correctness of the algorithm follows from the discussion above, we show +the bound on the space used below. + +In the first pass, we only need to store $k - 1$ key-value pairs for $A$ +(for example, as an unordered-list), +and the key and the value need $\lfloor\log_2 n \rfloor + 1$ and +$\lfloor \log_2 m \rfloor + 1$ bits respectively. +In the second pass, we have one key-value pair for each element of $C$, and +they take the same amount of space as above. + +\qed + \endchapter