Skip to content
GitLab
Menu
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
datovky
ds2-notes
Commits
5d2b3d02
Commit
5d2b3d02
authored
Apr 27, 2021
by
Parth Mittal
Browse files
rewrote misra/gries
parent
ad01c805
Changes
1
Hide whitespace changes
Inline
Side-by-side
streaming/streaming.tex
View file @
5d2b3d02
...
...
@@ -30,68 +30,73 @@ the occurences of $j$ in $\alpha[1 \ldots m]$. Then the majority problem
is to find (if it exists) a
$
j
$
such that
$
f
_
j > m
/
2
$
.
We consider the more general frequent elements problem, where we want to find
$
F
_
k
=
\{
j
\mid
f
_
j > m
/
k
\}
$
. Suppose that we (magically) knew some small set
$
C
$
which contains
$
F
_
k
$
. Then we can pass over the input once, keeping track of
how many times we see each member of
$
C
$
, and then find
$
F
_
k
$
easily.
The challenge is to find a small
$
C
$
, which is precisely what the Misra/Gries
Algorithm does.
$
F
_
k
=
\{
j
\mid
f
_
j > m
/
k
\}
$
. Suppose that we knew some small set
$
C
$
which contains
$
F
_
k
$
. Then, with a pass over the input, we can count the
occurrences of each element of
$
C
$
, and hence find
$
F
_
k
$
in
$
\O
(
\vert
C
\vert
\log
m
)
$
space.
\subsection
{
Misra/Gries Algorithm
}
\subsection
{
The Misra/Gries Algorithm
}
We will now see a deterministic one-pass algorithm that estimates the frequency
of each element in a stream of integers. We shall see that it also provides
us with a small set
$
C
$
containing
$
F
_
k
$
, and hence lets us solve the frequent
elements problem efficiently.
TODO: Typeset the algorithm better.
\proc
{
FrequencyEstimate
}$
(
\alpha
, k
)
$
\algin
the data stream
$
\alpha
$
, the target for the estimator
$
k
$
\:
Init:
$
A
\=
\emptyset
$
\:
For
$
j
$
a number from the stream:
\:
If
$
j
$
is a key in
$
A
$
,
$
A
[
j
]
\=
A
[
j
]
+
1
$
.
\:
Else If
$
\vert
A
\vert
< k
-
1
$
, add the key
$
j
$
to
$
A
$
and set
$
A
[
j
]
\=
1
$
.
\:
Else For each key
$
\ell
$
in
$
A
$
, reduce
$
A
[
\ell
]
\=
A
[
\ell
]
-
1
$
.
Delete
$
\ell
$
from
$
A
$
if
$
A
[
\ell
]
=
0
$
.
\:
After processing the entire stream, return A.
\:\em
{
Init
}
:
$
A
\=
\emptyset
$
. (an empty map)
\:\em
{
Process
}
(
$
x
$
):
\:
If
$
x
\in
$
keys(
$
A
$
),
$
A
[
x
]
\=
A
[
x
]
+
1
$
.
\:
Else If
$
\vert
$
keys(
$
A
$
)
$
\vert
< k
-
1
$
,
$
A
[
x
]
\=
1
$
.
\:
Else
\forall
$
a
\in
$
~keys(
$
A
$
):
$
A
[
a
]
\=
A
[
a
]
-
1
$
,
delete
$
a
$
from
$
A
$
if
$
A
[
a
]
=
0
$
.
\:\em
{
Output
}
:
$
\hat
{
f
}_
a
=
A
[
a
]
$
If
$
a
\in
$
~keys(
$
A
$
), and
$
\hat
{
f
}_
a
=
0
$
otherwise.
\endalgo
Let us show that
$
A
[
j
]
$
is a good estimate for the frequency
$
f
_
j
$
.
Let us show that
$
\hat
{
f
}_
a
$
is a good estimate for the frequency
$
f
_
a
$
.
\lemma
{
$
f
_
j
-
m
/
k
\leq
A
[
j
]
\leq
f
_
j
$
$
f
_
a
-
m
/
k
\leq
\hat
{
f
}_
a
\leq
f
_
a
$
}
\proof
Suppose that
$
A
$
maintains the value for each key
$
j
\in
[
n
]
$
(instead of
just
$
k
-
1
$
of them). We can recast
\alg
{
FrequencyEstimate
}
in this setting:
We always increment
$
A
[
j
]
$
on seeing
$
j
$
in the stream, but if there are
$
\geq
k
$
positive values
$
A
[
\ell
]
$
after this step, we decrease each of them
by 1.
In particular, this reduces the value of the most recently added key
$
A
[
j
]
$
back to
$
0
$
.
Now, we see immediately that
$
A
[
j
]
\leq
f
_
j
$
, since it is only incremented when
we see
$
j
$
in the stream. To see the other inequality, consider the potential
function
$
\Phi
=
\sum
_{
\ell
}
A
[
\ell
]
$
. Note that
$
\Phi
$
increases by exactly
$
m
$
(since the stream contains
$
m
$
elements), and is decreased by
$
k
$
every
time
$
A
[
j
]
$
decreases. Since
$
\Phi
=
0
$
initially and
$
\Phi
\geq
0
$
, we get
that
$
A
[
j
]
$
is decreased at most
$
m
/
k
$
times.
We see immediately that
$
\hat
{
f
}_
a
\leq
f
_
a
$
, since it is only incremented when
we see
$
a
$
in the stream.
To see the other inequality, suppose that we have a counter for each
$
a
\in
[
n
]
$
(instead of just
$
k
-
1
$
keys at a time). Whenever we have at least
$
k
$
non-zero counters, we will decrease all of them by
$
1
$
; this gives exactly
the same estimate as the algorithm above.
Now consider the potential
function
$
\Phi
=
\sum
_{
a
\in
[
n
]
}
A
[
a
]
$
. Note that
$
\Phi
$
increases by
exactly
$
m
$
(since
$
\alpha
$
contains
$
m
$
elements), and is decreased by
$
k
$
every time any
$
A
[
x
]
$
decreases. Since
$
\Phi
=
0
$
initially and
$
\Phi
\geq
0
$
,
we get that
$
A
[
x
]
$
decreases at most
$
m
/
k
$
times.
\qed
Now, for
$
j
\in
F
_
k
$
, we know that
$
f
_
j > m
/
k
$
, which implies that
$
A
[
j
]
>
0
$
.
Hence
$
F
_
k
\subseteq
C
=
\{
j
\mid
A
[
j
]
>
0
\}
$
, and we have a
$
C
$
of size
$
k
-
1
$
ready for the second pass over the input.
\theorem
{
There exists a deterministic 2-pass algorithm that finds
$
F
_
k
$
in
$
\O
(
k
(
\log
n
+
\log
m
))
$
space.
}
\proof
The correctness of the algorithm follows from the discussion above, we show
the bound on the space used below
.
In the first pass, we only need to store
$
k
-
1
$
key-value pairs for
$
A
$
(for example, as an unordered-list),
and the key and the value need
$
\lfloor\log
_
2
n
\rfloor
+
1
$
and
$
\lfloor
\log
_
2
m
\rfloor
+
1
$
bits respectively.
In the second pass, we have one key-value pair for each element of
$
C
$
, and
they take the same amount of space as above.
In the first pass, we obtain the frequency estimate
$
\hat
{
f
}$
by the
Misra/Gries algorithm
.
We set
$
C
=
\{
a
\mid
\hat
{
f
}_
a >
0
\}
$
. For
$
a
\in
F
_
k
$
, we have
$
f
_
a > m
/
k
$
, and hence
$
\hat
{
f
}_
a >
0
$
by the previous Lemma.
In the second pass, we count
$
f
_
c
$
exactly for each
$
c
\in
C
$
, and hence know
$
F
_
k
$
at the end.
To see the bound on space used, note that
$
\vert
C
\vert
=
\vert
$
keys(
$
A
$
)
$
\vert
\leq
k
-
1
$
, and a key-value pair can
be stored in
$
\O
(
\log
n
+
\log
m
)
$
bits.
\qed
\endchapter
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment