diff --git a/streaming/Makefile b/streaming/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..ba6c63ec5ded7d730418a9f14feceb7ebe02fa5e --- /dev/null +++ b/streaming/Makefile @@ -0,0 +1,3 @@ +TOP=.. + +include ../Makerules diff --git a/streaming/streaming.tex b/streaming/streaming.tex new file mode 100644 index 0000000000000000000000000000000000000000..9785c3c119697f3c2abbc75e1f823aea0bd1983d --- /dev/null +++ b/streaming/streaming.tex @@ -0,0 +1,339 @@ +\ifx\chapter\undefined +\input adsmac.tex +\singlechapter{20} +\fi + +\chapter[streaming]{Streaming Algorithms} + +For this chapter, we will consider the streaming model. In this +setting, the input is presented as a ``stream'' which we can read +\em{in order}. In particular, at each step, we can do some processing, +and then move forward one unit in the stream to read the next piece of data. +We can choose to read the input again after completing a ``pass'' over it. + +There are two measures for the performance of algorithms in this setting. +The first is the number of passes we make over the input, and the second is +the amount of memory that we consume. Some interesting special cases are: +\tightlist{o} +\: 1 pass, and $O(1)$ memory: This is equivalent to computing with a DFA, and +hence we can recognise only regular languages. +\: 1 pass, and unbounded memory: We can store the entire stream, and hence this +is just the traditional computing model. +\endlist + +\section{Frequent Elements} + +For this problem, the input is a stream $\alpha[1 \ldots m]$ where each +$\alpha[i] \in [n]$. +We define for each $j \in [n]$ the \em{frequency} $f_j$ which counts +the occurences of $j$ in $\alpha[1 \ldots m]$. Then the majority problem +is to find (if it exists) a $j$ such that $f_j > m / 2$. + +We consider the more general frequent elements problem, where we want to find +$F_k = \{ j \mid f_j > m / k \}$. Suppose that we knew some small set +$C$ which contains $F_k$. Then, with a pass over the input, we can count the +occurrences of each element of $C$, and hence find $F_k$ in +$\O(\vert C \vert \log m)$ space. + +\subsection{The Misra/Gries Algorithm} + +We will now see a deterministic one-pass algorithm that estimates the frequency +of each element in a stream of integers. We shall see that it also provides +us with a small set $C$ containing $F_k$, and hence lets us solve the frequent +elements problem efficiently. + +\algo{FrequencyEstimate} \algalias{Misra/Gries Algorithm} +\algin the data stream $\alpha$, the target for the estimator $k$. +\:\em{Init}: $A \= \emptyset$. \cmt{an empty map} +\:\em{Process}($x$): +\::If $x \in$ keys($A$), $A[x] \= A[x] + 1$. +\::Else If $\vert$keys($A$)$\vert < k - 1$, $A[x] \= 1$. +\::Else + \forall $a \in $~keys($A$): $A[a] \= A[a] - 1$, + delete $a$ from $A$ if $A[a] = 0$. +\algout $\hat{f}_a = A[a]$ If $a \in $~keys($A$), and $\hat{f}_a = 0$ otherwise. +\endalgo + +Let us show that $\hat{f}_a$ is a good estimate for the frequency $f_a$. + +\lemma{ +$f_a - m / k \leq \hat{f}_a \leq f_a$ +} + +\proof +We see immediately that $\hat{f}_a \leq f_a$, since it is only incremented when +we see $a$ in the stream. + +To see the other inequality, suppose that we have a counter for each +$a \in [n]$ (instead of just $k - 1$ keys at a time). Whenever we have at least +$k$ non-zero counters, we will decrease all of them by $1$; this gives exactly +the same estimate as the algorithm above. + +Now consider the potential +function $\Phi = \sum_{a \in [n]} A[a]$. Note that $\Phi$ increases by +exactly $m$ (since $\alpha$ contains $m$ elements), and is decreased by $k$ +every time any $A[x]$ decreases. Since $\Phi = 0$ initially and $\Phi \geq 0$, +we get that $A[x]$ decreases at most $m / k$ times. +\qed + +\theorem{ + There exists a deterministic 2-pass algorithm that finds $F_k$ in + $\O(k(\log n + \log m))$ space. +} +\proof +In the first pass, we obtain the frequency estimate $\hat{f}$ by the +Misra/Gries algorithm. +We set $C = \{ a \mid \hat{f}_a > 0 \}$. For $a \in F_k$, we have +$f_a > m / k$, and hence $\hat{f}_a > 0$ by the previous Lemma. +In the second pass, we count $f_c$ exactly for each $c \in C$, and hence know +$F_k$ at the end. + +To see the bound on space used, note that +$\vert C \vert = \vert$keys($A$)$\vert \leq k - 1$, and a key-value pair can +be stored in $\O(\log n + \log m)$ bits. +\qed + +\subsection{The Count-Min Sketch} + +We will now look at a randomized streaming algorithm that solves the +frequency estimation problem. While this algorithm can fail with some +probability, it has the advantage that the output on two different streams +can be easily combined. + +\algo{FrequencyEstimate} \algalias{Count-Min Sketch} +\algin the data stream $\alpha$, the accuracy $\varepsilon$, + the error parameter $\delta$. +\:\em{Init}: + $C[1\ldots t][1\ldots k] \= 0$, where $k \= \lceil 2 / \varepsilon \rceil$ + and $t \= \lceil \log(1 / \delta) \rceil$. +\:: Choose $t$ independent hash functions $h_1, \ldots , h_t : [n] \to [k]$, each + from a 2-independent family. +\:\em{Process}($x$): +\::For $i \in [t]$: $C[i][h_i(x)] \= C[i][h_i(x)] + 1$. +\algout Report $\hat{f}_a = \min_{i \in t} C[i][h_i(a)]$. +\endalgo + +Note that the algorithm needs $\O(tk \log m)$ bits to store the table $C$, and +$\O(t \log n)$ bits to store the hash functions $h_1, \ldots , h_t$, and hence +uses $\O(1/\varepsilon \cdot \log (1 / \delta) \cdot \log m + + \log (1 / \delta)\cdot \log n)$ bits. It remains to show that it computes +a good estimate. + +\lemma{ + $f_a \leq \hat{f}_a \leq f_a + \varepsilon m$ with probability $\delta$. +} + +\proof +Clearly $\hat{f}_a \geq f_a$ for all $a \in [n]$; we will show that +$\hat{f}_a \leq f_a + \varepsilon m$ with probability at least $\delta$. +For a fixed element $a$, define the random variable +$$X_i := C[i][h_i(a)] - f_a$$ +For $j \in [n] \setminus \{ a \}$, define the +indicator variable $Y_{i, j} := [ h_i(j) = h_i(a) ]$. Then we can see that +$$X_i = \sum_{j \neq a} f_j\cdot Y_{i, j}$$ + +Note that $\E[Y_{i, j}] = 1/k$ since each $h_i$ is from a 2-independent family, +and hence by linearity of expectation: +$$\E[X_i] = {\vert\vert f \vert\vert_1 - f_a \over k} = + {\vert\vert f_{-a} \vert\vert_1 \over k}$$ + +And by applying Markov's inequality we obtain a bound on the error of a single +counter: +$$ \Pr[X_i > \varepsilon \cdot m ] \geq + \Pr[ X_i > \varepsilon \cdot \vert\vert f_{-a} \vert\vert_1 ] \leq + {1 \over k\varepsilon} \leq 1/2$$ + +Finally, since we have $t$ independent counters, the probability that they +are all wrong is: + +$$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$ +\qed + +The main advantage of this algorithm is that its output on two different +streams (computed with the same set of hash functions $h_i$) is just the sum +of the respective tables $C$. It can also be extended to support events +which remove an occurence of an element $x$ (with the caveat that upon +termination the ``frequency'' $f_x$ for each $x$ must be non-negative). +(TODO: perhaps make the second part an exercise?). + + +\section{Counting Distinct Elements} +We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$, +and define $f_a$ (the frequency of $a$) as before. Let +$d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is +to estimate $d$. + +\subsection{The AMS Algorithm} +Suppose we map our universe $[n]$ to itself via a random permutation $\pi$. +Then if the number of distinct elements in a stream is $d$, we expect +$d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the +core idea of the following algorithm. + +Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$ +(i.e. the number of trailing zeroes in the base-2 representation of $x$). + +\algo{DistinctElements} \algalias{AMS} +\algin the data stream $\alpha$. +\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent + family. +\:: $z \= 0$. +\:\em{Process}($x$): +\::If ${\tt tz}(h(x)) > z$: $z \= {\tt tz}(h(x))$. +\algout $\hat{d} \= 2^{z + 1/2}$ +\endalgo + +\lemma{ + The AMS algorithm is a $(3, \delta)$-estimator for some constant + $\delta$. +} +\proof +For $j \in [n]$, $r \geq 0$, let $X_{r, j} := [ {\tt tz}(h(j)) \geq r ]$, the +indicator that is true if $h(j)$ has at least $r$ trailing $0$s. +Now define $$ Y_r = \sum_{j : f_j > 0} X_{r, j} $$ +How is our estimate related to $Y_r$? If the algorithm outputs +$\hat{d} \geq 2^{a + 1/2}$, then we know that $Y_a > 0$. Similarly, if the +output is smaller than $2^{a + 1/2}$, then we know that $Y_a = 0$. We will now +bound the probabilities of these events. + +For any $j \in [n]$, $h(j)$ is uniformly distributed over $[n]$ (since $h$ +is $2$-independent). Hence $\E[X_{r, j}] = 1 / 2^r$. By linearity of +expectation, $\E[Y_{r}] = d / 2^r$. + +We will also use the variance of these variables -- note that +$$\Var[X_{r, j}] \leq \E[X_{r, j}^2] = \E[X_{r, j}] = 1/2^r$$ + +And because $h$ is $2$-independent, the variables $X_{r, j}$ and $X_{r, j'}$ +are independent for $j \neq j'$, and hence: +$$\Var[Y_{r}] = \sum_{j : f_j > 0} \Var[X_{r, j}] \leq d / 2^r $$ + +Now, let $a$ be the smallest integer such that $2^{a + 1/2} \geq 3d$. Then we +have: +$$ \Pr[\hat{d} \geq 3d] = \Pr[Y_a > 0] = \Pr[Y_a \geq 1] $$ + +Using Markov's inequality we get: +$$ \Pr[\hat{d} \geq 3d] \leq \E[Y_a] = {d \over 2^a} \leq {\sqrt{2} \over 3} $$ + +For the other side, let $b$ be the smallest integer so that +$2^{b + 1/2} \leq d/3$. Then we have: +$$ \Pr[\hat{d} \leq d / 3] = \Pr[ Y_{b + 1} = 0] \leq + \Pr[ \vert Y_{b + 1} - \E[Y_{b + 1}] \vert \geq d / 2^{b + 1} ]$$ + +Using Chebyshev's inequality, we get: +$$ \Pr[\hat{d} < d / 3] \leq {\Var[Y_b] \over (d / 2^{b + 1})^2} \leq + {2^{b + 1} \over d} \leq {\sqrt{2} \over 3}$$ + +\qed + +The previous algorithm is not particularly satisfying -- by our analysis it +can make an error around $94\%$ of the time (taking the union of the two bad +events). However we can improve the success probability easily; we run $t$ +independent estimators simultaneously, and print the median of their outputs. +By a standard use of Chernoff Bounds one can show that the probability that +the median is more than $3d$ is at most $2^{-\Theta(t)}$ (and similarly also +the probability that it is less than $d / 3$). + +Hence it is enough to run $\O(\log (1/ \delta))$ copies of the AMS estimator +to get a $(3, \delta)$ estimator for any $\delta > 0$. Finally, we note that +the space used by a single estimator is $\O(\log n)$ since we can store $h$ in +$\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$ +estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits. + +\subsection{The BJKST Algorithm} + +We will now look at another algorithm for the distinct elements problem. +Note that unlike the AMS algorithm, it accepts an accuracy parameter +$\varepsilon$. + +\algo{DistinctElements} \algalias{BJKST} +\algin the data stream $\alpha$, the accuracy $\varepsilon$. +\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent + family. +\:: $z \= 0$, $B \= \emptyset$. +\:\em{Process}($x$): +\::If ${\tt tz}(h(x)) \geq z$: +\:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$ +\:::While $\vert B \vert \geq c/\varepsilon^2$: +\::::$z \= z + 1$. +\::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$. +\algout $\hat{d} \= \vert B \vert \cdot 2^{z}$. +\endalgo + +\lemma{ + For any $\varepsilon > 0$, the BJKST algorithm is an + $(\varepsilon, \delta)$-estimator for some constant $\delta$. +} + +\proof +We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote +the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$, +and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$. + +Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove +any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we +say that the algorithm \em{fails} iff +$\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the +algorithm fails iff: + +$$ \left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t} $$ + +To bound the probability of this event, we will sum over all possible values +$r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$, +a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is +large. For \em{large} values of $r$, simply achieving $t = r$ is difficult. +More formally, let $s$ be the unique integer such that: + +$$ {12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$ + +Then we have: +$$ \Pr[{\rm fail}] = \sum_{r = 1}^{\log n} + \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} + \land t = r \right] $$ +After splitting the sum around $s$, we bound small and large values by different +methods as described above to get: +$$ \Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1} + \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] + +\sum_{r = s}^{\log n} + \Pr\left[t = r \right] $$ +Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded +using Chebyshev's inequality. The second sum is equal to the probability of +the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$ +(since $z$ is only increased when $B$ becomes larger than this threshold). +We will use Markov's inequality to bound the probability of this event. + +Putting it all together, we have: +$$\eqalign{ + \Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1} + {\Var[Y_r] \over (\varepsilon d / 2^r)^2} + {\E[Y_{s - 1}] \over c / \varepsilon^2} + \leq \sum_{r = 1}^{s - 1} + {d / 2^r \over (\varepsilon d / 2^r)^2} + {d / 2^{s - 1} \over c / \varepsilon^2}\cr + &= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}} + \leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}} +} +$$ +Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and +$d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence: +$$ \Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c} $$ +which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an +$(\varepsilon, 1 / 6)$-estimator. + +\qed + +As before, we can run $\O(\log \delta)$ independent copies of the algorithm, +and take the median of their estimates to reduce the probability of failure +to $\delta$. The only thing remaining is to look at the space usage of the +algorithm. + +The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has +$\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits. +Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space +used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$ +space. As before, if we use the median trick, the space used increases to +$\O(\log\delta \cdot \log n / \varepsilon^2)$. + +(TODO: include the version of this algorithm where we save space by storing +$(g(a), {\tt tz}(h(a)))$ instead of $(a, {\tt tz}(h(a)))$ in $B$ for some +hash function $g$ as an exercise?) + +\endchapter + + diff --git a/tex/adsmac.tex b/tex/adsmac.tex index c85048ab430f2c21d3c1668e110c493e58558710..66aa3fe96697a4d490a4d86d0332d4311ca51c0c 100644 --- a/tex/adsmac.tex +++ b/tex/adsmac.tex @@ -170,6 +170,7 @@ \def\E{{\bb E}} \def\Pr{{\rm Pr}\mkern0.5mu} \def\Prsub#1{{\rm Pr}_{#1}} +\def\Var{{\rm Var}\mkern0.5mu} % Vektory \def\t{{\bf t}}