Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
datovky
ds2-notes
Commits
16df9964
Commit
16df9964
authored
May 08, 2021
by
Parth Mittal
Browse files
wrote the BJKST algorithm, some edits
parent
d445af83
Changes
1
Hide whitespace changes
Inline
Side-by-side
streaming/streaming.tex
View file @
16df9964
...
...
@@ -149,6 +149,14 @@ are all wrong is:
$$
\Pr\left
[
\bigcap
_
i X
_
i >
\varepsilon
\cdot
m
\right
]
\leq
1
/
2
^
t
\leq
\delta
$$
\qed
The main advantage of this algorithm is that its output on two different
streams (computed with the same set of hash functions
$
h
_
i
$
) is just the sum
of the respective tables
$
C
$
. It can also be extended to support events
which remove an occurence of an element
$
x
$
(with the caveat that upon
termination the ``frequency''
$
f
_
x
$
for each
$
x
$
must be non-negative).
(TODO: perhaps make the second part an exercise?).
\section
{
Counting Distinct Elements
}
We continue working with a stream
$
\alpha
[
1
\ldots
m
]
$
of integers from
$
[
n
]
$
,
and define
$
f
_
a
$
(the frequency of
$
a
$
) as before. Let
...
...
@@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
to estimate
$
d
$
.
\subsection
{
The AMS Algorithm
}
Suppose we map our universe
$
[
n
]
$
to itself via a random permutation
$
\pi
$
.
Then if the number of distinct elements in a stream is
$
d
$
, we expect
$
d
/
2
^
i
$
of them to be divisible by
$
2
^
i
$
after applying
$
\pi
$
. This is the
core idea of the following algorithm.
Define
${
\tt
tz
}
(
x
)
:
=
\max\{
i
\mid
2
^
i
$
~divides~
$
x
\}
$
(i.e. the number of trailing zeroes in the base-2 representation of
$
x
$
).
(i.e. the number of trailing zeroes in the base-2 representation of
$
x
$
).
\algo
{
DistinctElements
}
\algalias
{
AMS
}
\algin
the data stream
$
\alpha
$
, the accuracy
$
\varepsilon
$
,
the error parameter
$
\delta
$
.
\algin
the data stream
$
\alpha
$
.
\:\em
{
Init
}
: Choose a random hash function
$
h :
[
n
]
\to
[
n
]
$
from a 2-independent
family.
\:
:
$
z
\=
0
$
.
...
...
@@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
$
\O
(
\log
n
)
$
bits, and
$
z
$
in
$
\O
(
\log
\log
n
)
$
bits, and hence a
$
(
3
,
\delta
)
$
estimator uses
$
\O
(
\log
(
1
/
\delta
)
\cdot
\log
n
)
$
bits.
\subsection
{
The BJKST Algorithm
}
We will now look at another algorithm for the distinct elements problem.
Note that unlike the AMS algorithm, it accepts an accuracy parameter
$
\varepsilon
$
.
\algo
{
DistinctElements
}
\algalias
{
BJKST
}
\algin
the data stream
$
\alpha
$
, the accuracy
$
\varepsilon
$
.
\:\em
{
Init
}
: Choose a random hash function
$
h :
[
n
]
\to
[
n
]
$
from a 2-independent
family.
\:
:
$
z
\=
0
$
,
$
B
\=
\emptyset
$
.
\:\em
{
Process
}
(
$
x
$
):
\:
:If
${
\tt
tz
}
(
h
(
x
))
\geq
z
$
:
\:
::
$
B
\=
B
\cup
\{
(
x,
{
\tt
tz
}
(
h
(
x
))
\}
$
\:
::While
$
\vert
B
\vert
\geq
c
/
\varepsilon
^
2
$
:
\:
:::
$
z
\=
z
+
1
$
.
\:
:::Remove all
$
(
a, b
)
$
from
$
B
$
such that
$
b
=
{
\tt
tz
}
(
h
(
a
))
< z
$
.
\algout
$
\hat
{
d
}
\=
\vert
B
\vert
\cdot
2
^{
z
}$
.
\endalgo
\lemma
{
For any
$
\varepsilon
>
0
$
, the BJKST algorithm is an
$
(
\varepsilon
,
\delta
)
$
-estimator for some constant
$
\delta
$
.
}
\proof
We setup the random variables
$
X
_{
r, j
}$
and
$
Y
_
r
$
as before. Let
$
t
$
denote
the value of
$
z
$
when the algorithm terminates, then
$
Y
_
t
=
\vert
B
\vert
$
,
and our estimate
$
\hat
{
d
}
=
\vert
B
\vert
\cdot
2
^
t
=
Y
_
t
\cdot
2
^
t
$
.
Note that if
$
t
=
0
$
, the algorithm computes
$
d
$
exactly (since we never remove
any elements from
$
B
$
, and
$
\hat
{
d
}
=
\vert
B
\vert
$
). For
$
t
\geq
1
$
, we
say that the algorithm
\em
{
fails
}
iff
$
\vert
Y
_
t
\cdot
2
^
t
-
d
\vert
>
\varepsilon
d
$
. Rearranging, we have that the
algorithm fails iff:
$$
\left\vert
Y
_
t
-
{
d
\over
2
^
t
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
t
}
$$
To bound the probability of this event, we will sum over all possible values
$
r
\in
[
\log
n
]
$
that
$
t
$
can take. Note that for
\em
{
small
}
values of
$
r
$
,
a failure is unlikely when
$
t
=
r
$
, since the required deviation
$
d
/
2
^
t
$
is
large. For
\em
{
large
}
values of
$
r
$
, simply achieving
$
t
=
r
$
is difficult.
More formally, let
$
s
$
be the unique integer such that:
$$
{
12
\over
\varepsilon
^
2
}
\leq
{
d
\over
2
^
s
}
\leq
{
24
\over
\varepsilon
^
2
}$$
Then we have:
$$
\Pr
[
{
\rm
fail
}
]
=
\sum
_{
r
=
1
}^{
\log
n
}
\Pr\left
[
\left\vert
Y
_
r
-
{
d
\over
2
^
r
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
r
}
\land
t
=
r
\right
]
$$
After splitting the sum around
$
s
$
, we bound small and large values by different
methods as described above to get:
$$
\Pr
[
{
\rm
fail
}
]
\leq
\sum
_{
r
=
1
}^{
s
-
1
}
\Pr\left
[
\left\vert
Y
_
r
-
{
d
\over
2
^
r
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
r
}
\right
]
+
\sum
_{
r
=
s
}^{
\log
n
}
\Pr\left
[
t
=
r
\right
]
$$
Recall that
$
\E
[
Y
_
r
]
=
d
/
2
^
r
$
, so the terms in the first sum can be bounded
using Chebyshev's inequality. The second sum is equal to the probability of
the event
$
[
t
\geq
s
]
$
, that is, the event
$
Y
_{
s
-
1
}
\geq
c
/
\varepsilon
^
2
$
(since
$
z
$
is only increased when
$
B
$
becomes larger than this threshold).
We will simply use Markov's inequality to bound this event.
Putting it all together, we have:
$$
\eqalign
{
\Pr
[
{
\rm
fail
}
]
&
\leq
\sum
_{
r
=
1
}^{
s
-
1
}
{
\Var
[
Y
_
r
]
\over
(
\varepsilon
d
/
2
^
r
)
^
2
}
+
{
\E
[
Y
_{
s
-
1
}
]
\over
c
/
\varepsilon
^
2
}
\leq
\sum
_{
r
=
1
}^{
s
-
1
}
{
d
/
2
^
r
\over
(
\varepsilon
d
/
2
^
r
)
^
2
}
+
{
d
/
2
^{
s
-
1
}
\over
c
/
\varepsilon
^
2
}
\cr
&
=
\sum
_{
r
=
1
}^{
s
-
1
}
{
2
^
r
\over
\varepsilon
^
2
d
}
+
{
\varepsilon
^
2
d
\over
c
2
^{
s
-
1
}}
\leq
{
2
^{
s
}
\over
\varepsilon
^
2
d
}
+
{
\varepsilon
^
2
d
\over
c
2
^{
s
-
1
}}
}
$$
Recalling the definition of
$
s
$
, we have
$
2
^
s
/
d
\leq
\varepsilon
^
2
/
12
$
, and
$
d
/
2
^{
s
-
1
}
\leq
48
/
\varepsilon
^
2
$
, and hence:
$$
\Pr
[
{
\rm
fail
}
]
\leq
{
1
\over
12
}
+
{
48
\over
c
}
$$
which is smaller than (say)
$
1
/
6
$
for
$
c >
576
$
. Hence the algorithm is an
$
(
\varepsilon
,
1
/
6
)
$
-estimator.
\qed
As before, we can run
$
\O
(
\log
\delta
)
$
independent copies of the algorithm,
and take the median of their estimates to reduce the probability of failure
to
$
\delta
$
. The only thing remaining is to look at the space usage of the
algorithm.
The counter
$
z
$
requires only
$
\O
(
\log
\log
n
)
$
bits, and
$
B
$
has
$
\O
(
1
/
\varepsilon
^
2
)
$
entries, each of which needs
$
\O
(
\log
n
)
$
bits.
Finally, the hash function
$
h
$
needs
$
\O
(
\log
n
)
$
bits, so the total space
used is dominated by
$
B
$
, and the algorithm uses
$
\O
(
\log
n
/
\varepsilon
^
2
)
$
space.
\endchapter
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment