......@@ -149,6 +149,14 @@ are all wrong is:
$$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$
The main advantage of this algorithm is that its output on two different
streams (computed with the same set of hash functions $h_i$) is just the sum
of the respective tables $C$. It can also be extended to support events
which remove an occurence of an element $x$ (with the caveat that upon
termination the ``frequency'' $f_x$ for each $x$ must be non-negative).
\section{Counting Distinct Elements}
We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$,
and define $f_a$ (the frequency of $a$) as before. Let
......@@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
to estimate $d$.
\subsection{The AMS Algorithm}
Suppose we map our universe $[n]$ to itself via a random permutation $\pi$.
Then if the number of distinct elements in a stream is $d$, we expect
$d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the
core idea of the following algorithm.
Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$
(i.e. the number of trailing zeroes in the base-2 representation of $x$).
\algo{DistinctElements} \algalias{AMS}
\algin the data stream $\alpha$, the accuracy $\varepsilon$,
the error parameter $\delta$.
\algin the data stream $\alpha$.
\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
\:: $z \= 0$.
......@@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
$\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$
estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits.
\subsection{The BJKST Algorithm}
We will now look at another algorithm for the distinct elements problem.
Note that unlike the AMS algorithm, it accepts an accuracy parameter
\algo{DistinctElements} \algalias{BJKST}
\algin the data stream $\alpha$, the accuracy $\varepsilon$.
\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
\:: $z \= 0$, $B \= \emptyset$.
\::If ${\tt tz}(h(x)) \geq z$:
\:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$
\:::While $\vert B \vert \geq c/\varepsilon^2$:
\::::$z \= z + 1$.
\::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$.
\algout $\hat{d} \= \vert B \vert \cdot 2^{z}$.
For any $\varepsilon > 0$, the BJKST algorithm is an
$(\varepsilon, \delta)$-estimator for some constant $\delta$.
We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote
the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$,
and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$.
Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove
any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we
say that the algorithm \em{fails} iff
$\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the
algorithm fails iff:
$$ \left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t} $$
To bound the probability of this event, we will sum over all possible values
$r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$,
a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is
large. For \em{large} values of $r$, simply achieving $t = r$ is difficult.
More formally, let $s$ be the unique integer such that:
$$ {12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$
Then we have:
$$ \Pr[{\rm fail}] = \sum_{r = 1}^{\log n}
\Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r}
\land t = r \right] $$
After splitting the sum around $s$, we bound small and large values by different
methods as described above to get:
$$ \Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1}
\Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] +
\sum_{r = s}^{\log n}
\Pr\left[t = r \right] $$
Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded
using Chebyshev's inequality. The second sum is equal to the probability of
the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$
(since $z$ is only increased when $B$ becomes larger than this threshold).
We will simply use Markov's inequality to bound this event.
Putting it all together, we have:
\Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1}
{\Var[Y_r] \over (\varepsilon d / 2^r)^2} + {\E[Y_{s - 1}] \over c / \varepsilon^2}
\leq \sum_{r = 1}^{s - 1}
{d / 2^r \over (\varepsilon d / 2^r)^2} + {d / 2^{s - 1} \over c / \varepsilon^2}\cr
&= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
\leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and
$d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence:
$$ \Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c} $$
which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an
$(\varepsilon, 1 / 6)$-estimator.
As before, we can run $\O(\log \delta)$ independent copies of the algorithm,
and take the median of their estimates to reduce the probability of failure
to $\delta$. The only thing remaining is to look at the space usage of the
The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has
$\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits.
Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space
used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$
