Skip to content
Snippets Groups Projects
Commit 5d2b3d02 authored by Parth Mittal's avatar Parth Mittal
Browse files

rewrote misra/gries

parent ad01c805
Branches
No related tags found
No related merge requests found
...@@ -30,68 +30,73 @@ the occurences of $j$ in $\alpha[1 \ldots m]$. Then the majority problem ...@@ -30,68 +30,73 @@ the occurences of $j$ in $\alpha[1 \ldots m]$. Then the majority problem
is to find (if it exists) a $j$ such that $f_j > m / 2$. is to find (if it exists) a $j$ such that $f_j > m / 2$.
We consider the more general frequent elements problem, where we want to find We consider the more general frequent elements problem, where we want to find
$F_k = \{ j \mid f_j > m / k \}$. Suppose that we (magically) knew some small set $F_k = \{ j \mid f_j > m / k \}$. Suppose that we knew some small set
$C$ which contains $F_k$. Then we can pass over the input once, keeping track of $C$ which contains $F_k$. Then, with a pass over the input, we can count the
how many times we see each member of $C$, and then find $F_k$ easily. occurrences of each element of $C$, and hence find $F_k$ in
The challenge is to find a small $C$, which is precisely what the Misra/Gries $\O(\vert C \vert \log m)$ space.
Algorithm does.
\subsection{Misra/Gries Algorithm} \subsection{The Misra/Gries Algorithm}
We will now see a deterministic one-pass algorithm that estimates the frequency
of each element in a stream of integers. We shall see that it also provides
us with a small set $C$ containing $F_k$, and hence lets us solve the frequent
elements problem efficiently.
TODO: Typeset the algorithm better. TODO: Typeset the algorithm better.
\proc{FrequencyEstimate}$(\alpha, k)$ \proc{FrequencyEstimate}$(\alpha, k)$
\algin the data stream $\alpha$, the target for the estimator $k$ \algin the data stream $\alpha$, the target for the estimator $k$
\:Init: $A \= \emptyset$ \:\em{Init}: $A \= \emptyset$. (an empty map)
\:For $j$ a number from the stream: \:\em{Process}($x$):
\:If $j$ is a key in $A$, $A[j] \= A[j] + 1$. \: If $x \in$ keys($A$), $A[x] \= A[x] + 1$.
\:Else If $\vert A \vert < k - 1$, add the key $j$ to $A$ and set $A[j] \= 1$. \: Else If $\vert$keys($A$)$\vert < k - 1$, $A[x] \= 1$.
\:Else For each key $\ell$ in $A$, reduce $A[\ell] \= A[\ell] - 1$. \: Else
Delete $\ell$ from $A$ if $A[\ell] = 0$. \forall $a \in $~keys($A$): $A[a] \= A[a] - 1$,
\:After processing the entire stream, return A. delete $a$ from $A$ if $A[a] = 0$.
\:\em{Output}: $\hat{f}_a = A[a]$ If $a \in $~keys($A$), and $\hat{f}_a = 0$ otherwise.
\endalgo \endalgo
Let us show that $A[j]$ is a good estimate for the frequency $f_j$. Let us show that $\hat{f}_a$ is a good estimate for the frequency $f_a$.
\lemma{ \lemma{
$f_j - m / k \leq A[j] \leq f_j$ $f_a - m / k \leq \hat{f}_a \leq f_a$
} }
\proof \proof
Suppose that $A$ maintains the value for each key $j \in [n]$ (instead of We see immediately that $\hat{f}_a \leq f_a$, since it is only incremented when
just $k - 1$ of them). We can recast \alg{FrequencyEstimate} in this setting: we see $a$ in the stream.
We always increment $A[j]$ on seeing $j$ in the stream, but if there are
$\geq k$ positive values $A[\ell]$ after this step, we decrease each of them To see the other inequality, suppose that we have a counter for each
by 1. $a \in [n]$ (instead of just $k - 1$ keys at a time). Whenever we have at least
In particular, this reduces the value of the most recently added key $A[j]$ $k$ non-zero counters, we will decrease all of them by $1$; this gives exactly
back to $0$. the same estimate as the algorithm above.
Now, we see immediately that $A[j] \leq f_j$, since it is only incremented when Now consider the potential
we see $j$ in the stream. To see the other inequality, consider the potential function $\Phi = \sum_{a \in [n]} A[a]$. Note that $\Phi$ increases by
function $\Phi = \sum_{\ell} A[\ell]$. Note that $\Phi$ increases by exactly exactly $m$ (since $\alpha$ contains $m$ elements), and is decreased by $k$
$m$ (since the stream contains $m$ elements), and is decreased by $k$ every every time any $A[x]$ decreases. Since $\Phi = 0$ initially and $\Phi \geq 0$,
time $A[j]$ decreases. Since $\Phi = 0$ initially and $\Phi \geq 0$, we get we get that $A[x]$ decreases at most $m / k$ times.
that $A[j]$ is decreased at most $m / k$ times.
\qed \qed
Now, for $j \in F_k$, we know that $f_j > m / k$, which implies that $A[j] > 0$.
Hence $F_k \subseteq C = \{ j \mid A[j] > 0 \}$, and we have a $C$ of size
$k - 1$ ready for the second pass over the input.
\theorem{ \theorem{
There exists a deterministic 2-pass algorithm that finds $F_k$ in There exists a deterministic 2-pass algorithm that finds $F_k$ in
$\O(k(\log n + \log m))$ space. $\O(k(\log n + \log m))$ space.
} }
\proof \proof
The correctness of the algorithm follows from the discussion above, we show In the first pass, we obtain the frequency estimate $\hat{f}$ by the
the bound on the space used below. Misra/Gries algorithm.
We set $C = \{ a \mid \hat{f}_a > 0 \}$. For $a \in F_k$, we have
In the first pass, we only need to store $k - 1$ key-value pairs for $A$ $f_a > m / k$, and hence $\hat{f}_a > 0$ by the previous Lemma.
(for example, as an unordered-list), In the second pass, we count $f_c$ exactly for each $c \in C$, and hence know
and the key and the value need $\lfloor\log_2 n \rfloor + 1$ and $F_k$ at the end.
$\lfloor \log_2 m \rfloor + 1$ bits respectively.
In the second pass, we have one key-value pair for each element of $C$, and To see the bound on space used, note that
they take the same amount of space as above. $\vert C \vert = \vert$keys($A$)$\vert \leq k - 1$, and a key-value pair can
be stored in $\O(\log n + \log m)$ bits.
\qed \qed
\endchapter \endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment