Skip to content
Snippets Groups Projects
Commit 16df9964 authored by Parth Mittal's avatar Parth Mittal
Browse files

wrote the BJKST algorithm, some edits

parent d445af83
No related branches found
No related tags found
No related merge requests found
......@@ -149,6 +149,14 @@ are all wrong is:
$$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$
\qed
The main advantage of this algorithm is that its output on two different
streams (computed with the same set of hash functions $h_i$) is just the sum
of the respective tables $C$. It can also be extended to support events
which remove an occurence of an element $x$ (with the caveat that upon
termination the ``frequency'' $f_x$ for each $x$ must be non-negative).
(TODO: perhaps make the second part an exercise?).
\section{Counting Distinct Elements}
We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$,
and define $f_a$ (the frequency of $a$) as before. Let
......@@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
to estimate $d$.
\subsection{The AMS Algorithm}
Suppose we map our universe $[n]$ to itself via a random permutation $\pi$.
Then if the number of distinct elements in a stream is $d$, we expect
$d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the
core idea of the following algorithm.
Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$
(i.e. the number of trailing zeroes in the base-2 representation of $x$).
\algo{DistinctElements} \algalias{AMS}
\algin the data stream $\alpha$, the accuracy $\varepsilon$,
the error parameter $\delta$.
\algin the data stream $\alpha$.
\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
family.
\:: $z \= 0$.
......@@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
$\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$
estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits.
\subsection{The BJKST Algorithm}
We will now look at another algorithm for the distinct elements problem.
Note that unlike the AMS algorithm, it accepts an accuracy parameter
$\varepsilon$.
\algo{DistinctElements} \algalias{BJKST}
\algin the data stream $\alpha$, the accuracy $\varepsilon$.
\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
family.
\:: $z \= 0$, $B \= \emptyset$.
\:\em{Process}($x$):
\::If ${\tt tz}(h(x)) \geq z$:
\:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$
\:::While $\vert B \vert \geq c/\varepsilon^2$:
\::::$z \= z + 1$.
\::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$.
\algout $\hat{d} \= \vert B \vert \cdot 2^{z}$.
\endalgo
\lemma{
For any $\varepsilon > 0$, the BJKST algorithm is an
$(\varepsilon, \delta)$-estimator for some constant $\delta$.
}
\proof
We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote
the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$,
and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$.
Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove
any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we
say that the algorithm \em{fails} iff
$\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the
algorithm fails iff:
$$ \left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t} $$
To bound the probability of this event, we will sum over all possible values
$r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$,
a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is
large. For \em{large} values of $r$, simply achieving $t = r$ is difficult.
More formally, let $s$ be the unique integer such that:
$$ {12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$
Then we have:
$$ \Pr[{\rm fail}] = \sum_{r = 1}^{\log n}
\Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r}
\land t = r \right] $$
After splitting the sum around $s$, we bound small and large values by different
methods as described above to get:
$$ \Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1}
\Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] +
\sum_{r = s}^{\log n}
\Pr\left[t = r \right] $$
Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded
using Chebyshev's inequality. The second sum is equal to the probability of
the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$
(since $z$ is only increased when $B$ becomes larger than this threshold).
We will simply use Markov's inequality to bound this event.
Putting it all together, we have:
$$\eqalign{
\Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1}
{\Var[Y_r] \over (\varepsilon d / 2^r)^2} + {\E[Y_{s - 1}] \over c / \varepsilon^2}
\leq \sum_{r = 1}^{s - 1}
{d / 2^r \over (\varepsilon d / 2^r)^2} + {d / 2^{s - 1} \over c / \varepsilon^2}\cr
&= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
\leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
}
$$
Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and
$d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence:
$$ \Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c} $$
which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an
$(\varepsilon, 1 / 6)$-estimator.
\qed
As before, we can run $\O(\log \delta)$ independent copies of the algorithm,
and take the median of their estimates to reduce the probability of failure
to $\delta$. The only thing remaining is to look at the space usage of the
algorithm.
The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has
$\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits.
Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space
used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$
space.
\endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment