Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
datovky
ds2notes
Commits
16df9964
Commit
16df9964
authored
May 08, 2021
by
Parth Mittal
Browse files
wrote the BJKST algorithm, some edits
parent
d445af83
Changes
1
Hide whitespace changes
Inline
Sidebyside
streaming/streaming.tex
View file @
16df9964
...
...
@@ 149,6 +149,14 @@ are all wrong is:
$$
\Pr\left
[
\bigcap
_
i X
_
i >
\varepsilon
\cdot
m
\right
]
\leq
1
/
2
^
t
\leq
\delta
$$
\qed
The main advantage of this algorithm is that its output on two different
streams (computed with the same set of hash functions
$
h
_
i
$
) is just the sum
of the respective tables
$
C
$
. It can also be extended to support events
which remove an occurence of an element
$
x
$
(with the caveat that upon
termination the ``frequency''
$
f
_
x
$
for each
$
x
$
must be nonnegative).
(TODO: perhaps make the second part an exercise?).
\section
{
Counting Distinct Elements
}
We continue working with a stream
$
\alpha
[
1
\ldots
m
]
$
of integers from
$
[
n
]
$
,
and define
$
f
_
a
$
(the frequency of
$
a
$
) as before. Let
...
...
@@ 156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
to estimate
$
d
$
.
\subsection
{
The AMS Algorithm
}
Suppose we map our universe
$
[
n
]
$
to itself via a random permutation
$
\pi
$
.
Then if the number of distinct elements in a stream is
$
d
$
, we expect
$
d
/
2
^
i
$
of them to be divisible by
$
2
^
i
$
after applying
$
\pi
$
. This is the
core idea of the following algorithm.
Define
${
\tt
tz
}
(
x
)
:
=
\max\{
i
\mid
2
^
i
$
~divides~
$
x
\}
$
(i.e. the number of trailing zeroes in the base2 representation of
$
x
$
).
(i.e. the number of trailing zeroes in the base2 representation of
$
x
$
).
\algo
{
DistinctElements
}
\algalias
{
AMS
}
\algin
the data stream
$
\alpha
$
, the accuracy
$
\varepsilon
$
,
the error parameter
$
\delta
$
.
\algin
the data stream
$
\alpha
$
.
\:\em
{
Init
}
: Choose a random hash function
$
h :
[
n
]
\to
[
n
]
$
from a 2independent
family.
\:
:
$
z
\=
0
$
.
...
...
@@ 227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
$
\O
(
\log
n
)
$
bits, and
$
z
$
in
$
\O
(
\log
\log
n
)
$
bits, and hence a
$
(
3
,
\delta
)
$
estimator uses
$
\O
(
\log
(
1
/
\delta
)
\cdot
\log
n
)
$
bits.
\subsection
{
The BJKST Algorithm
}
We will now look at another algorithm for the distinct elements problem.
Note that unlike the AMS algorithm, it accepts an accuracy parameter
$
\varepsilon
$
.
\algo
{
DistinctElements
}
\algalias
{
BJKST
}
\algin
the data stream
$
\alpha
$
, the accuracy
$
\varepsilon
$
.
\:\em
{
Init
}
: Choose a random hash function
$
h :
[
n
]
\to
[
n
]
$
from a 2independent
family.
\:
:
$
z
\=
0
$
,
$
B
\=
\emptyset
$
.
\:\em
{
Process
}
(
$
x
$
):
\:
:If
${
\tt
tz
}
(
h
(
x
))
\geq
z
$
:
\:
::
$
B
\=
B
\cup
\{
(
x,
{
\tt
tz
}
(
h
(
x
))
\}
$
\:
::While
$
\vert
B
\vert
\geq
c
/
\varepsilon
^
2
$
:
\:
:::
$
z
\=
z
+
1
$
.
\:
:::Remove all
$
(
a, b
)
$
from
$
B
$
such that
$
b
=
{
\tt
tz
}
(
h
(
a
))
< z
$
.
\algout
$
\hat
{
d
}
\=
\vert
B
\vert
\cdot
2
^{
z
}$
.
\endalgo
\lemma
{
For any
$
\varepsilon
>
0
$
, the BJKST algorithm is an
$
(
\varepsilon
,
\delta
)
$
estimator for some constant
$
\delta
$
.
}
\proof
We setup the random variables
$
X
_{
r, j
}$
and
$
Y
_
r
$
as before. Let
$
t
$
denote
the value of
$
z
$
when the algorithm terminates, then
$
Y
_
t
=
\vert
B
\vert
$
,
and our estimate
$
\hat
{
d
}
=
\vert
B
\vert
\cdot
2
^
t
=
Y
_
t
\cdot
2
^
t
$
.
Note that if
$
t
=
0
$
, the algorithm computes
$
d
$
exactly (since we never remove
any elements from
$
B
$
, and
$
\hat
{
d
}
=
\vert
B
\vert
$
). For
$
t
\geq
1
$
, we
say that the algorithm
\em
{
fails
}
iff
$
\vert
Y
_
t
\cdot
2
^
t

d
\vert
>
\varepsilon
d
$
. Rearranging, we have that the
algorithm fails iff:
$$
\left\vert
Y
_
t

{
d
\over
2
^
t
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
t
}
$$
To bound the probability of this event, we will sum over all possible values
$
r
\in
[
\log
n
]
$
that
$
t
$
can take. Note that for
\em
{
small
}
values of
$
r
$
,
a failure is unlikely when
$
t
=
r
$
, since the required deviation
$
d
/
2
^
t
$
is
large. For
\em
{
large
}
values of
$
r
$
, simply achieving
$
t
=
r
$
is difficult.
More formally, let
$
s
$
be the unique integer such that:
$$
{
12
\over
\varepsilon
^
2
}
\leq
{
d
\over
2
^
s
}
\leq
{
24
\over
\varepsilon
^
2
}$$
Then we have:
$$
\Pr
[
{
\rm
fail
}
]
=
\sum
_{
r
=
1
}^{
\log
n
}
\Pr\left
[
\left\vert
Y
_
r

{
d
\over
2
^
r
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
r
}
\land
t
=
r
\right
]
$$
After splitting the sum around
$
s
$
, we bound small and large values by different
methods as described above to get:
$$
\Pr
[
{
\rm
fail
}
]
\leq
\sum
_{
r
=
1
}^{
s

1
}
\Pr\left
[
\left\vert
Y
_
r

{
d
\over
2
^
r
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
r
}
\right
]
+
\sum
_{
r
=
s
}^{
\log
n
}
\Pr\left
[
t
=
r
\right
]
$$
Recall that
$
\E
[
Y
_
r
]
=
d
/
2
^
r
$
, so the terms in the first sum can be bounded
using Chebyshev's inequality. The second sum is equal to the probability of
the event
$
[
t
\geq
s
]
$
, that is, the event
$
Y
_{
s

1
}
\geq
c
/
\varepsilon
^
2
$
(since
$
z
$
is only increased when
$
B
$
becomes larger than this threshold).
We will simply use Markov's inequality to bound this event.
Putting it all together, we have:
$$
\eqalign
{
\Pr
[
{
\rm
fail
}
]
&
\leq
\sum
_{
r
=
1
}^{
s

1
}
{
\Var
[
Y
_
r
]
\over
(
\varepsilon
d
/
2
^
r
)
^
2
}
+
{
\E
[
Y
_{
s

1
}
]
\over
c
/
\varepsilon
^
2
}
\leq
\sum
_{
r
=
1
}^{
s

1
}
{
d
/
2
^
r
\over
(
\varepsilon
d
/
2
^
r
)
^
2
}
+
{
d
/
2
^{
s

1
}
\over
c
/
\varepsilon
^
2
}
\cr
&
=
\sum
_{
r
=
1
}^{
s

1
}
{
2
^
r
\over
\varepsilon
^
2
d
}
+
{
\varepsilon
^
2
d
\over
c
2
^{
s

1
}}
\leq
{
2
^{
s
}
\over
\varepsilon
^
2
d
}
+
{
\varepsilon
^
2
d
\over
c
2
^{
s

1
}}
}
$$
Recalling the definition of
$
s
$
, we have
$
2
^
s
/
d
\leq
\varepsilon
^
2
/
12
$
, and
$
d
/
2
^{
s

1
}
\leq
48
/
\varepsilon
^
2
$
, and hence:
$$
\Pr
[
{
\rm
fail
}
]
\leq
{
1
\over
12
}
+
{
48
\over
c
}
$$
which is smaller than (say)
$
1
/
6
$
for
$
c >
576
$
. Hence the algorithm is an
$
(
\varepsilon
,
1
/
6
)
$
estimator.
\qed
As before, we can run
$
\O
(
\log
\delta
)
$
independent copies of the algorithm,
and take the median of their estimates to reduce the probability of failure
to
$
\delta
$
. The only thing remaining is to look at the space usage of the
algorithm.
The counter
$
z
$
requires only
$
\O
(
\log
\log
n
)
$
bits, and
$
B
$
has
$
\O
(
1
/
\varepsilon
^
2
)
$
entries, each of which needs
$
\O
(
\log
n
)
$
bits.
Finally, the hash function
$
h
$
needs
$
\O
(
\log
n
)
$
bits, so the total space
used is dominated by
$
B
$
, and the algorithm uses
$
\O
(
\log
n
/
\varepsilon
^
2
)
$
space.
\endchapter
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment