From 16df9964a6b68522c885ae0327bb86470c1b0a59 Mon Sep 17 00:00:00 2001 From: Parth Mittal <parth15069@iiitd.ac.in> Date: Sat, 8 May 2021 14:06:21 +0530 Subject: [PATCH] wrote the BJKST algorithm, some edits --- streaming/streaming.tex | 110 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 107 insertions(+), 3 deletions(-) diff --git a/streaming/streaming.tex b/streaming/streaming.tex index e5f77fe..0abcf72 100644 --- a/streaming/streaming.tex +++ b/streaming/streaming.tex @@ -149,6 +149,14 @@ are all wrong is: $$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$ \qed +The main advantage of this algorithm is that its output on two different +streams (computed with the same set of hash functions $h_i$) is just the sum +of the respective tables $C$. It can also be extended to support events +which remove an occurence of an element $x$ (with the caveat that upon +termination the ``frequency'' $f_x$ for each $x$ must be non-negative). +(TODO: perhaps make the second part an exercise?). + + \section{Counting Distinct Elements} We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$, and define $f_a$ (the frequency of $a$) as before. Let @@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is to estimate $d$. \subsection{The AMS Algorithm} +Suppose we map our universe $[n]$ to itself via a random permutation $\pi$. +Then if the number of distinct elements in a stream is $d$, we expect +$d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the +core idea of the following algorithm. Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$ - (i.e. the number of trailing zeroes in the base-2 representation of $x$). +(i.e. the number of trailing zeroes in the base-2 representation of $x$). \algo{DistinctElements} \algalias{AMS} -\algin the data stream $\alpha$, the accuracy $\varepsilon$, - the error parameter $\delta$. +\algin the data stream $\alpha$. \:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent family. \:: $z \= 0$. @@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in $\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$ estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits. +\subsection{The BJKST Algorithm} + +We will now look at another algorithm for the distinct elements problem. +Note that unlike the AMS algorithm, it accepts an accuracy parameter +$\varepsilon$. + +\algo{DistinctElements} \algalias{BJKST} +\algin the data stream $\alpha$, the accuracy $\varepsilon$. +\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent + family. +\:: $z \= 0$, $B \= \emptyset$. +\:\em{Process}($x$): +\::If ${\tt tz}(h(x)) \geq z$: +\:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$ +\:::While $\vert B \vert \geq c/\varepsilon^2$: +\::::$z \= z + 1$. +\::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$. +\algout $\hat{d} \= \vert B \vert \cdot 2^{z}$. +\endalgo + +\lemma{ + For any $\varepsilon > 0$, the BJKST algorithm is an + $(\varepsilon, \delta)$-estimator for some constant $\delta$. +} + +\proof +We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote +the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$, +and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$. + +Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove +any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we +say that the algorithm \em{fails} iff +$\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the +algorithm fails iff: + +$$ \left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t} $$ + +To bound the probability of this event, we will sum over all possible values +$r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$, +a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is +large. For \em{large} values of $r$, simply achieving $t = r$ is difficult. +More formally, let $s$ be the unique integer such that: + +$$ {12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$ + +Then we have: +$$ \Pr[{\rm fail}] = \sum_{r = 1}^{\log n} + \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} + \land t = r \right] $$ +After splitting the sum around $s$, we bound small and large values by different +methods as described above to get: +$$ \Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1} + \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] + +\sum_{r = s}^{\log n} + \Pr\left[t = r \right] $$ +Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded +using Chebyshev's inequality. The second sum is equal to the probability of +the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$ +(since $z$ is only increased when $B$ becomes larger than this threshold). +We will simply use Markov's inequality to bound this event. + +Putting it all together, we have: +$$\eqalign{ + \Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1} + {\Var[Y_r] \over (\varepsilon d / 2^r)^2} + {\E[Y_{s - 1}] \over c / \varepsilon^2} + \leq \sum_{r = 1}^{s - 1} + {d / 2^r \over (\varepsilon d / 2^r)^2} + {d / 2^{s - 1} \over c / \varepsilon^2}\cr + &= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}} + \leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}} +} +$$ +Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and +$d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence: +$$ \Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c} $$ +which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an +$(\varepsilon, 1 / 6)$-estimator. + +\qed + +As before, we can run $\O(\log \delta)$ independent copies of the algorithm, +and take the median of their estimates to reduce the probability of failure +to $\delta$. The only thing remaining is to look at the space usage of the +algorithm. + +The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has +$\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits. +Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space +used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$ +space. + \endchapter + + -- GitLab