wrote the BJKST algorithm, some edits

16df9964 · Parth Mittal · d445af83 · 16df9964
Commit 16df9964 authored 4 years ago by Parth Mittal
--- a/streaming/streaming.tex
+++ b/streaming/streaming.tex
@@ -149,6 +149,14 @@ are all wrong is:
 $$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$
 \qed
+The main advantage of this algorithm is that its output on two different
+streams (computed with the same set of hash functions $h_i$) is just the sum
+of the respective tables $C$. It can also be extended to support events
+which remove an occurence of an element $x$ (with the caveat that upon
+termination the ``frequency'' $f_x$ for each $x$ must be non-negative).
+(TODO: perhaps make the second part an exercise?).
 \section{Counting Distinct Elements}
 We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$,
 and define $f_a$ (the frequency of $a$) as before. Let
@@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
 to estimate $d$.
 \subsection{The AMS Algorithm}
+Suppose we map our universe $[n]$ to itself via a random permutation $\pi$.
+Then if the number of distinct elements in a stream is $d$, we expect
+$d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the
+core idea of the following algorithm.
 Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$
 (i.e. the number of trailing zeroes in the base-2 representation of $x$).
 \algo{DistinctElements} \algalias{AMS}
-\algin the data stream $\alpha$, the accuracy $\varepsilon$,
+\algin the data stream $\alpha$.
- the error parameter $\delta$.
 \:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
 family.
 \:: $z \= 0$.
@@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
 $\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$
 estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits.
+\subsection{The BJKST Algorithm}
+We will now look at another algorithm for the distinct elements problem.
+Note that unlike the AMS algorithm, it accepts an accuracy parameter
+$\varepsilon$.
+\algo{DistinctElements} \algalias{BJKST}
+\algin the data stream $\alpha$, the accuracy $\varepsilon$.
+\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
+ family.
+\:: $z \= 0$, $B \= \emptyset$.
+\:\em{Process}($x$): 
+\::If ${\tt tz}(h(x)) \geq z$: 
+\:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$
+\:::While $\vert B \vert \geq c/\varepsilon^2$:
+\::::$z \= z + 1$.
+\::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$.
+\algout $\hat{d} \= \vert B \vert \cdot 2^{z}$.
+\endalgo
+\lemma{
+    For any $\varepsilon > 0$, the BJKST algorithm is an
+    $(\varepsilon, \delta)$-estimator for some constant $\delta$.
+}
+\proof
+We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote
+the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$,
+and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$.
+Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove
+any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we
+say that the algorithm \em{fails} iff
+$\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the
+algorithm fails iff:
+$$ \left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t} $$
+To bound the probability of this event, we will sum over all possible values
+$r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$,
+a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is
+large. For \em{large} values of $r$, simply achieving $t = r$ is difficult.
+More formally, let $s$ be the unique integer such that:
+$$ {12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$
+Then we have:
+$$ \Pr[{\rm fail}] = \sum_{r = 1}^{\log n}
+ \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r}
+    \land t = r \right] $$
+After splitting the sum around $s$, we bound small and large values by different
+methods as described above to get:
+$$ \Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1} 
+ \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] +
+\sum_{r = s}^{\log n}
+    \Pr\left[t = r \right] $$
+Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded
+using Chebyshev's inequality. The second sum is equal to the probability of
+the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$
+(since $z$ is only increased when $B$ becomes larger than this threshold).
+We will simply use Markov's inequality to bound this event.
+Putting it all together, we have:
+$$\eqalign{
+ \Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1} 
+ {\Var[Y_r] \over (\varepsilon d / 2^r)^2}  + {\E[Y_{s - 1}] \over c / \varepsilon^2}
+ \leq \sum_{r = 1}^{s - 1}
+ {d / 2^r \over (\varepsilon d / 2^r)^2}  + {d / 2^{s - 1} \over c / \varepsilon^2}\cr
+ &= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
+ \leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
+}
+$$
+Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and
+$d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence:
+$$ \Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c} $$
+which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an
+$(\varepsilon, 1 / 6)$-estimator.
+\qed
+As before, we can run $\O(\log \delta)$ independent copies of the algorithm,
+and take the median of their estimates to reduce the probability of failure
+to $\delta$. The only thing remaining is to look at the space usage of the
+algorithm.
+The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has
+$\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits.
+Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space
+used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$
+space.
 \endchapter