Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
16df9964
Commit
16df9964
authored
4 years ago
by
Parth Mittal
Browse files
Options
Downloads
Patches
Plain Diff
wrote the BJKST algorithm, some edits
parent
d445af83
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
streaming/streaming.tex
+107
-3
107 additions, 3 deletions
streaming/streaming.tex
with
107 additions
and
3 deletions
streaming/streaming.tex
+
107
−
3
View file @
16df9964
...
...
@@ -149,6 +149,14 @@ are all wrong is:
$$
\Pr\left
[
\bigcap
_
i X
_
i >
\varepsilon
\cdot
m
\right
]
\leq
1
/
2
^
t
\leq
\delta
$$
\qed
The main advantage of this algorithm is that its output on two different
streams (computed with the same set of hash functions
$
h
_
i
$
) is just the sum
of the respective tables
$
C
$
. It can also be extended to support events
which remove an occurence of an element
$
x
$
(with the caveat that upon
termination the ``frequency''
$
f
_
x
$
for each
$
x
$
must be non-negative).
(TODO: perhaps make the second part an exercise?).
\section
{
Counting Distinct Elements
}
We continue working with a stream
$
\alpha
[
1
\ldots
m
]
$
of integers from
$
[
n
]
$
,
and define
$
f
_
a
$
(the frequency of
$
a
$
) as before. Let
...
...
@@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
to estimate
$
d
$
.
\subsection
{
The AMS Algorithm
}
Suppose we map our universe
$
[
n
]
$
to itself via a random permutation
$
\pi
$
.
Then if the number of distinct elements in a stream is
$
d
$
, we expect
$
d
/
2
^
i
$
of them to be divisible by
$
2
^
i
$
after applying
$
\pi
$
. This is the
core idea of the following algorithm.
Define
${
\tt
tz
}
(
x
)
:
=
\max\{
i
\mid
2
^
i
$
~divides~
$
x
\}
$
(i.e. the number of trailing zeroes in the base-2 representation of
$
x
$
).
\algo
{
DistinctElements
}
\algalias
{
AMS
}
\algin
the data stream
$
\alpha
$
, the accuracy
$
\varepsilon
$
,
the error parameter
$
\delta
$
.
\algin
the data stream
$
\alpha
$
.
\:\em
{
Init
}
: Choose a random hash function
$
h :
[
n
]
\to
[
n
]
$
from a 2-independent
family.
\:
:
$
z
\=
0
$
.
...
...
@@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
$
\O
(
\log
n
)
$
bits, and
$
z
$
in
$
\O
(
\log
\log
n
)
$
bits, and hence a
$
(
3
,
\delta
)
$
estimator uses
$
\O
(
\log
(
1
/
\delta
)
\cdot
\log
n
)
$
bits.
\subsection
{
The BJKST Algorithm
}
We will now look at another algorithm for the distinct elements problem.
Note that unlike the AMS algorithm, it accepts an accuracy parameter
$
\varepsilon
$
.
\algo
{
DistinctElements
}
\algalias
{
BJKST
}
\algin
the data stream
$
\alpha
$
, the accuracy
$
\varepsilon
$
.
\:\em
{
Init
}
: Choose a random hash function
$
h :
[
n
]
\to
[
n
]
$
from a 2-independent
family.
\:
:
$
z
\=
0
$
,
$
B
\=
\emptyset
$
.
\:\em
{
Process
}
(
$
x
$
):
\:
:If
${
\tt
tz
}
(
h
(
x
))
\geq
z
$
:
\:
::
$
B
\=
B
\cup
\{
(
x,
{
\tt
tz
}
(
h
(
x
))
\}
$
\:
::While
$
\vert
B
\vert
\geq
c
/
\varepsilon
^
2
$
:
\:
:::
$
z
\=
z
+
1
$
.
\:
:::Remove all
$
(
a, b
)
$
from
$
B
$
such that
$
b
=
{
\tt
tz
}
(
h
(
a
))
< z
$
.
\algout
$
\hat
{
d
}
\=
\vert
B
\vert
\cdot
2
^{
z
}$
.
\endalgo
\lemma
{
For any
$
\varepsilon
>
0
$
, the BJKST algorithm is an
$
(
\varepsilon
,
\delta
)
$
-estimator for some constant
$
\delta
$
.
}
\proof
We setup the random variables
$
X
_{
r, j
}$
and
$
Y
_
r
$
as before. Let
$
t
$
denote
the value of
$
z
$
when the algorithm terminates, then
$
Y
_
t
=
\vert
B
\vert
$
,
and our estimate
$
\hat
{
d
}
=
\vert
B
\vert
\cdot
2
^
t
=
Y
_
t
\cdot
2
^
t
$
.
Note that if
$
t
=
0
$
, the algorithm computes
$
d
$
exactly (since we never remove
any elements from
$
B
$
, and
$
\hat
{
d
}
=
\vert
B
\vert
$
). For
$
t
\geq
1
$
, we
say that the algorithm
\em
{
fails
}
iff
$
\vert
Y
_
t
\cdot
2
^
t
-
d
\vert
>
\varepsilon
d
$
. Rearranging, we have that the
algorithm fails iff:
$$
\left\vert
Y
_
t
-
{
d
\over
2
^
t
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
t
}
$$
To bound the probability of this event, we will sum over all possible values
$
r
\in
[
\log
n
]
$
that
$
t
$
can take. Note that for
\em
{
small
}
values of
$
r
$
,
a failure is unlikely when
$
t
=
r
$
, since the required deviation
$
d
/
2
^
t
$
is
large. For
\em
{
large
}
values of
$
r
$
, simply achieving
$
t
=
r
$
is difficult.
More formally, let
$
s
$
be the unique integer such that:
$$
{
12
\over
\varepsilon
^
2
}
\leq
{
d
\over
2
^
s
}
\leq
{
24
\over
\varepsilon
^
2
}$$
Then we have:
$$
\Pr
[
{
\rm
fail
}
]
=
\sum
_{
r
=
1
}^{
\log
n
}
\Pr\left
[
\left\vert
Y
_
r
-
{
d
\over
2
^
r
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
r
}
\land
t
=
r
\right
]
$$
After splitting the sum around
$
s
$
, we bound small and large values by different
methods as described above to get:
$$
\Pr
[
{
\rm
fail
}
]
\leq
\sum
_{
r
=
1
}^{
s
-
1
}
\Pr\left
[
\left\vert
Y
_
r
-
{
d
\over
2
^
r
}
\right\vert
\geq
{
\varepsilon
d
\over
2
^
r
}
\right
]
+
\sum
_{
r
=
s
}^{
\log
n
}
\Pr\left
[
t
=
r
\right
]
$$
Recall that
$
\E
[
Y
_
r
]
=
d
/
2
^
r
$
, so the terms in the first sum can be bounded
using Chebyshev's inequality. The second sum is equal to the probability of
the event
$
[
t
\geq
s
]
$
, that is, the event
$
Y
_{
s
-
1
}
\geq
c
/
\varepsilon
^
2
$
(since
$
z
$
is only increased when
$
B
$
becomes larger than this threshold).
We will simply use Markov's inequality to bound this event.
Putting it all together, we have:
$$
\eqalign
{
\Pr
[
{
\rm
fail
}
]
&
\leq
\sum
_{
r
=
1
}^{
s
-
1
}
{
\Var
[
Y
_
r
]
\over
(
\varepsilon
d
/
2
^
r
)
^
2
}
+
{
\E
[
Y
_{
s
-
1
}
]
\over
c
/
\varepsilon
^
2
}
\leq
\sum
_{
r
=
1
}^{
s
-
1
}
{
d
/
2
^
r
\over
(
\varepsilon
d
/
2
^
r
)
^
2
}
+
{
d
/
2
^{
s
-
1
}
\over
c
/
\varepsilon
^
2
}
\cr
&
=
\sum
_{
r
=
1
}^{
s
-
1
}
{
2
^
r
\over
\varepsilon
^
2
d
}
+
{
\varepsilon
^
2
d
\over
c
2
^{
s
-
1
}}
\leq
{
2
^{
s
}
\over
\varepsilon
^
2
d
}
+
{
\varepsilon
^
2
d
\over
c
2
^{
s
-
1
}}
}
$$
Recalling the definition of
$
s
$
, we have
$
2
^
s
/
d
\leq
\varepsilon
^
2
/
12
$
, and
$
d
/
2
^{
s
-
1
}
\leq
48
/
\varepsilon
^
2
$
, and hence:
$$
\Pr
[
{
\rm
fail
}
]
\leq
{
1
\over
12
}
+
{
48
\over
c
}
$$
which is smaller than (say)
$
1
/
6
$
for
$
c >
576
$
. Hence the algorithm is an
$
(
\varepsilon
,
1
/
6
)
$
-estimator.
\qed
As before, we can run
$
\O
(
\log
\delta
)
$
independent copies of the algorithm,
and take the median of their estimates to reduce the probability of failure
to
$
\delta
$
. The only thing remaining is to look at the space usage of the
algorithm.
The counter
$
z
$
requires only
$
\O
(
\log
\log
n
)
$
bits, and
$
B
$
has
$
\O
(
1
/
\varepsilon
^
2
)
$
entries, each of which needs
$
\O
(
\log
n
)
$
bits.
Finally, the hash function
$
h
$
needs
$
\O
(
\log
n
)
$
bits, so the total space
used is dominated by
$
B
$
, and the algorithm uses
$
\O
(
\log
n
/
\varepsilon
^
2
)
$
space.
\endchapter
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment