Commit 16df9964 by Parth Mittal

### wrote the BJKST algorithm, some edits

parent d445af83
 ... ... @@ -149,6 +149,14 @@ are all wrong is: $$\Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta$$ \qed The main advantage of this algorithm is that its output on two different streams (computed with the same set of hash functions $h_i$) is just the sum of the respective tables $C$. It can also be extended to support events which remove an occurence of an element $x$ (with the caveat that upon termination the frequency'' $f_x$ for each $x$ must be non-negative). (TODO: perhaps make the second part an exercise?). \section{Counting Distinct Elements} We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$, and define $f_a$ (the frequency of $a$) as before. Let ... ... @@ -156,13 +164,16 @@ $d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is to estimate $d$. \subsection{The AMS Algorithm} Suppose we map our universe $[n]$ to itself via a random permutation $\pi$. Then if the number of distinct elements in a stream is $d$, we expect $d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the core idea of the following algorithm. Define ${\tt tz}(x) := \max\{ i \mid 2^i$~divides~$x \}$ (i.e. the number of trailing zeroes in the base-2 representation of $x$). (i.e. the number of trailing zeroes in the base-2 representation of $x$). \algo{DistinctElements} \algalias{AMS} \algin the data stream $\alpha$, the accuracy $\varepsilon$, the error parameter $\delta$. \algin the data stream $\alpha$. \:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent family. \:: $z \= 0$. ... ... @@ -227,4 +238,97 @@ the space used by a single estimator is $\O(\log n)$ since we can store $h$ in $\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$ estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits. \subsection{The BJKST Algorithm} We will now look at another algorithm for the distinct elements problem. Note that unlike the AMS algorithm, it accepts an accuracy parameter $\varepsilon$. \algo{DistinctElements} \algalias{BJKST} \algin the data stream $\alpha$, the accuracy $\varepsilon$. \:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent family. \:: $z \= 0$, $B \= \emptyset$. \:\em{Process}($x$): \::If ${\tt tz}(h(x)) \geq z$: \:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$ \:::While $\vert B \vert \geq c/\varepsilon^2$: \::::$z \= z + 1$. \::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$. \algout $\hat{d} \= \vert B \vert \cdot 2^{z}$. \endalgo \lemma{ For any $\varepsilon > 0$, the BJKST algorithm is an $(\varepsilon, \delta)$-estimator for some constant $\delta$. } \proof We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$, and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$. Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we say that the algorithm \em{fails} iff $\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the algorithm fails iff: $$\left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t}$$ To bound the probability of this event, we will sum over all possible values $r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$, a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is large. For \em{large} values of $r$, simply achieving $t = r$ is difficult. More formally, let $s$ be the unique integer such that: $${12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$ Then we have: $$\Pr[{\rm fail}] = \sum_{r = 1}^{\log n} \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \land t = r \right]$$ After splitting the sum around $s$, we bound small and large values by different methods as described above to get: $$\Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1} \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] + \sum_{r = s}^{\log n} \Pr\left[t = r \right]$$ Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded using Chebyshev's inequality. The second sum is equal to the probability of the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$ (since $z$ is only increased when $B$ becomes larger than this threshold). We will simply use Markov's inequality to bound this event. Putting it all together, we have: \eqalign{ \Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1} {\Var[Y_r] \over (\varepsilon d / 2^r)^2} + {\E[Y_{s - 1}] \over c / \varepsilon^2} \leq \sum_{r = 1}^{s - 1} {d / 2^r \over (\varepsilon d / 2^r)^2} + {d / 2^{s - 1} \over c / \varepsilon^2}\cr &= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}} \leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}} } Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and $d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence: $$\Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c}$$ which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an $(\varepsilon, 1 / 6)$-estimator. \qed As before, we can run $\O(\log \delta)$ independent copies of the algorithm, and take the median of their estimates to reduce the probability of failure to $\delta$. The only thing remaining is to look at the space usage of the algorithm. The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has $\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits. Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$ space. \endchapter
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!