Merge branch 'pm-streaming'

c23efdce · Martin Mareš · 983e1af4 · 9882fc7d · c23efdce · c23efdce
Commit c23efdce authored 3 years ago by Martin Mareš
--- a/streaming/Makefile
+++ b/streaming/Makefile
+TOP=..
+include ../Makerules
--- a/streaming/streaming.tex
+++ b/streaming/streaming.tex
+\ifx\chapter\undefined
+\input adsmac.tex
+\singlechapter{20}
+\fi
+\chapter[streaming]{Streaming Algorithms}
+For this chapter, we will consider the streaming model. In this
+setting, the input is presented as a ``stream'' which we can read
+\em{in order}. In particular, at each step, we can do some processing,
+and then move forward one unit in the stream to read the next piece of data.
+We can choose to read the input again after completing a ``pass'' over it.
+There are two measures for the performance of algorithms in this setting.
+The first is the number of passes we make over the input, and the second is
+the amount of memory that we consume. Some interesting special cases are:
+\tightlist{o}
+\: 1 pass, and $O(1)$ memory: This is equivalent to computing with a DFA, and
+hence we can recognise only regular languages.
+\: 1 pass, and unbounded memory: We can store the entire stream, and hence this
+is just the traditional computing model.
+\endlist
+\section{Frequent Elements}
+For this problem, the input is a stream $\alpha[1 \ldots m]$ where each
+$\alpha[i] \in [n]$.
+We define for each $j \in [n]$ the \em{frequency} $f_j$ which counts
+the occurences of $j$ in $\alpha[1 \ldots m]$. Then the majority problem
+is to find (if it exists) a $j$ such that $f_j > m / 2$.
+We consider the more general frequent elements problem, where we want to find
+$F_k = \{ j \mid f_j > m / k \}$. Suppose that we knew some small set
+$C$ which contains $F_k$. Then, with a pass over the input, we can count the
+occurrences of each element of $C$, and hence find $F_k$ in
+$\O(\vert C \vert \log m)$ space.
+\subsection{The Misra/Gries Algorithm}
+We will now see a deterministic one-pass algorithm that estimates the frequency
+of each element in a stream of integers. We shall see that it also provides
+us with a small set $C$ containing $F_k$, and hence lets us solve the frequent
+elements problem efficiently.
+\algo{FrequencyEstimate} \algalias{Misra/Gries Algorithm}
+\algin the data stream $\alpha$, the target for the estimator $k$.
+\:\em{Init}: $A \= \emptyset$. \cmt{an empty map}
+\:\em{Process}($x$): 
+\::If $x \in$ keys($A$), $A[x] \= A[x] + 1$.
+\::Else If $\vert$keys($A$)$\vert < k - 1$, $A[x] \= 1$.
+\::Else
+      \forall $a \in $~keys($A$): $A[a] \= A[a] - 1$,
+      delete $a$ from $A$ if $A[a] = 0$.
+\algout $\hat{f}_a = A[a]$ If $a \in $~keys($A$), and $\hat{f}_a = 0$ otherwise.
+\endalgo
+Let us show that $\hat{f}_a$ is a good estimate for the frequency $f_a$.
+\lemma{
+$f_a - m / k \leq \hat{f}_a \leq f_a$
+}
+\proof
+We see immediately that $\hat{f}_a \leq f_a$, since it is only incremented when
+we see $a$ in the stream.
+To see the other inequality, suppose that we have a counter for each
+$a \in [n]$ (instead of just $k - 1$ keys at a time). Whenever we have at least
+$k$ non-zero counters, we will decrease all of them by $1$; this gives exactly
+the same estimate as the algorithm above. 
+Now consider the potential
+function $\Phi = \sum_{a \in [n]} A[a]$. Note that $\Phi$ increases by
+exactly $m$ (since $\alpha$ contains $m$ elements), and is decreased by $k$
+every time any $A[x]$ decreases. Since $\Phi = 0$ initially and $\Phi \geq 0$,
+we get that $A[x]$ decreases at most $m / k$ times.
+\qed
+\theorem{
+    There exists a deterministic 2-pass algorithm that finds $F_k$ in
+    $\O(k(\log n + \log m))$ space.
+}
+\proof
+In the first pass, we obtain the frequency estimate $\hat{f}$ by the
+Misra/Gries algorithm.
+We set $C = \{ a \mid \hat{f}_a > 0 \}$. For $a \in F_k$, we have
+$f_a > m / k$, and hence $\hat{f}_a > 0$ by the previous Lemma.
+In the second pass, we count $f_c$ exactly for each $c \in C$, and hence know
+$F_k$ at the end.
+To see the bound on space used, note that
+$\vert C \vert = \vert$keys($A$)$\vert \leq k - 1$, and a key-value pair can
+be stored in $\O(\log n + \log m)$ bits.
+\qed
+\subsection{The Count-Min Sketch}
+We will now look at a randomized streaming algorithm that solves the 
+frequency estimation problem. While this algorithm can fail with some
+probability, it has the advantage that the output on two different streams
+can be easily combined.
+\algo{FrequencyEstimate} \algalias{Count-Min Sketch}
+\algin the data stream $\alpha$, the accuracy $\varepsilon$,
+ the error parameter $\delta$.
+\:\em{Init}:
+ $C[1\ldots t][1\ldots k] \= 0$, where $k \= \lceil 2 / \varepsilon \rceil$
+ and $t \= \lceil \log(1 / \delta) \rceil$.
+\:: Choose $t$ independent hash functions $h_1, \ldots , h_t : [n] \to [k]$, each
+ from a 2-independent family.
+\:\em{Process}($x$): 
+\::For $i \in [t]$: $C[i][h_i(x)] \= C[i][h_i(x)] + 1$.
+\algout Report $\hat{f}_a = \min_{i \in t} C[i][h_i(a)]$.
+\endalgo
+Note that the algorithm needs $\O(tk \log m)$ bits to store the table $C$, and
+$\O(t \log n)$ bits to store the hash functions $h_1, \ldots , h_t$, and hence
+uses $\O(1/\varepsilon \cdot \log (1 / \delta) \cdot \log m
+ + \log (1 / \delta)\cdot  \log n)$ bits. It remains to show that it computes
+a good estimate.
+\lemma{
+    $f_a \leq \hat{f}_a \leq f_a + \varepsilon m$ with probability $\delta$.
+}
+\proof
+Clearly $\hat{f}_a \geq f_a$ for all $a \in [n]$; we will show that
+$\hat{f}_a \leq f_a + \varepsilon m$ with probability at least $\delta$.
+For a fixed element $a$, define the random variable
+$$X_i := C[i][h_i(a)] - f_a$$
+For $j \in [n] \setminus \{ a \}$, define the
+indicator variable $Y_{i, j} := [ h_i(j) = h_i(a) ]$. Then we can see that
+$$X_i = \sum_{j \neq a} f_j\cdot Y_{i, j}$$
+Note that $\E[Y_{i, j}] = 1/k$ since each $h_i$ is from a 2-independent family,
+and hence by linearity of expectation:
+$$\E[X_i] = {\vert\vert f \vert\vert_1 - f_a \over k} =
+ {\vert\vert f_{-a} \vert\vert_1 \over k}$$
+And by applying Markov's inequality we obtain a bound on the error of a single
+counter:
+$$ \Pr[X_i > \varepsilon \cdot m ] \geq
+ \Pr[ X_i > \varepsilon \cdot \vert\vert f_{-a} \vert\vert_1 ] \leq
+ {1 \over k\varepsilon} \leq 1/2$$
+Finally, since we have $t$ independent counters, the probability that they
+are all wrong is:
+$$ \Pr\left[\bigcap_i X_i > \varepsilon \cdot m \right] \leq 1/2^t \leq \delta $$
+\qed
+The main advantage of this algorithm is that its output on two different
+streams (computed with the same set of hash functions $h_i$) is just the sum
+of the respective tables $C$. It can also be extended to support events
+which remove an occurence of an element $x$ (with the caveat that upon
+termination the ``frequency'' $f_x$ for each $x$ must be non-negative).
+(TODO: perhaps make the second part an exercise?).
+\section{Counting Distinct Elements}
+We continue working with a stream $\alpha[1 \ldots m]$ of integers from $[n]$,
+and define $f_a$ (the frequency of $a$) as before. Let
+$d = \vert \{ j : f_j > 0 \} \vert$. Then the distinct elements problem is
+to estimate $d$.
+\subsection{The AMS Algorithm}
+Suppose we map our universe $[n]$ to itself via a random permutation $\pi$.
+Then if the number of distinct elements in a stream is $d$, we expect
+$d / 2^i$ of them to be divisible by $2^i$ after applying $\pi$. This is the
+core idea of the following algorithm.
+Define ${\tt tz}(x) := \max\{ i \mid 2^i $~divides~$ x \}$
+(i.e. the number of trailing zeroes in the base-2 representation of $x$).
+\algo{DistinctElements} \algalias{AMS}
+\algin the data stream $\alpha$.
+\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
+ family.
+\:: $z \= 0$.
+\:\em{Process}($x$): 
+\::If ${\tt tz}(h(x)) > z$: $z \= {\tt tz}(h(x))$.
+\algout $\hat{d} \= 2^{z + 1/2}$
+\endalgo
+\lemma{
+    The AMS algorithm is a $(3, \delta)$-estimator for some constant
+    $\delta$.
+}
+\proof
+For $j \in [n]$, $r \geq 0$, let $X_{r, j} := [ {\tt tz}(h(j)) \geq r ]$, the
+indicator that is true if $h(j)$ has at least $r$ trailing $0$s.
+Now define $$ Y_r = \sum_{j : f_j > 0} X_{r, j} $$
+How is our estimate related to $Y_r$? If the algorithm outputs
+$\hat{d} \geq 2^{a + 1/2}$, then we know that $Y_a > 0$. Similarly, if the
+output is smaller than $2^{a + 1/2}$, then we know that $Y_a = 0$. We will now
+bound the probabilities of these events.
+For any $j \in [n]$, $h(j)$ is uniformly distributed over $[n]$ (since $h$
+is $2$-independent). Hence $\E[X_{r, j}] = 1 / 2^r$. By linearity of
+expectation, $\E[Y_{r}] = d / 2^r$.
+We will also use the variance of these variables -- note that
+$$\Var[X_{r, j}] \leq \E[X_{r, j}^2] = \E[X_{r, j}] = 1/2^r$$
+And because $h$ is $2$-independent, the variables $X_{r, j}$ and $X_{r, j'}$
+are independent for $j \neq j'$, and hence:
+$$\Var[Y_{r}] = \sum_{j : f_j > 0} \Var[X_{r, j}] \leq d / 2^r $$
+Now, let $a$ be the smallest integer such that $2^{a + 1/2} \geq 3d$. Then we
+have:
+$$ \Pr[\hat{d} \geq 3d] = \Pr[Y_a > 0] = \Pr[Y_a \geq 1] $$
+Using Markov's inequality we get:
+$$ \Pr[\hat{d} \geq 3d] \leq \E[Y_a] = {d \over 2^a} \leq {\sqrt{2} \over 3} $$
+For the other side, let $b$ be the smallest integer so that
+$2^{b + 1/2} \leq d/3$. Then we have:
+$$ \Pr[\hat{d} \leq d / 3] = \Pr[ Y_{b + 1} = 0] \leq
+ \Pr[ \vert Y_{b + 1} - \E[Y_{b + 1}] \vert \geq d / 2^{b + 1} ]$$
+Using Chebyshev's inequality, we get:
+$$ \Pr[\hat{d} < d / 3] \leq {\Var[Y_b] \over (d / 2^{b + 1})^2} \leq
+ {2^{b + 1} \over d} \leq {\sqrt{2} \over 3}$$
+\qed
+The previous algorithm is not particularly satisfying -- by our analysis it
+can make an error around $94\%$ of the time (taking the union of the two bad
+events). However we can improve the success probability easily; we run $t$
+independent estimators simultaneously, and print the median of their outputs.
+By a standard use of Chernoff Bounds one can show that the probability that
+the median is more than $3d$ is at most $2^{-\Theta(t)}$ (and similarly also
+the probability that it is less than $d / 3$).
+Hence it is enough to run $\O(\log (1/ \delta))$ copies of the AMS estimator
+to get a $(3, \delta)$ estimator for any $\delta > 0$. Finally, we note that
+the space used by a single estimator is $\O(\log n)$ since we can store $h$ in
+$\O(\log n)$ bits, and $z$ in $\O(\log \log n)$ bits, and hence a $(3, \delta)$
+estimator uses $\O(\log (1/\delta) \cdot \log n)$ bits.
+\subsection{The BJKST Algorithm}
+We will now look at another algorithm for the distinct elements problem.
+Note that unlike the AMS algorithm, it accepts an accuracy parameter
+$\varepsilon$.
+\algo{DistinctElements} \algalias{BJKST}
+\algin the data stream $\alpha$, the accuracy $\varepsilon$.
+\:\em{Init}: Choose a random hash function $h : [n] \to [n]$ from a 2-independent
+ family.
+\:: $z \= 0$, $B \= \emptyset$.
+\:\em{Process}($x$): 
+\::If ${\tt tz}(h(x)) \geq z$: 
+\:::$B \= B \cup \{ (x, {\tt tz}(h(x)) \}$
+\:::While $\vert B \vert \geq c/\varepsilon^2$:
+\::::$z \= z + 1$.
+\::::Remove all $(a, b)$ from $B$ such that $b = {\tt tz}(h(a)) < z$.
+\algout $\hat{d} \= \vert B \vert \cdot 2^{z}$.
+\endalgo
+\lemma{
+    For any $\varepsilon > 0$, the BJKST algorithm is an
+    $(\varepsilon, \delta)$-estimator for some constant $\delta$.
+}
+\proof
+We setup the random variables $X_{r, j}$ and $Y_r$ as before. Let $t$ denote
+the value of $z$ when the algorithm terminates, then $Y_t = \vert B \vert$,
+and our estimate $\hat{d} = \vert B \vert \cdot 2^t = Y_t \cdot 2^t$.
+Note that if $t = 0$, the algorithm computes $d$ exactly (since we never remove
+any elements from $B$, and $\hat{d} = \vert B \vert$). For $t \geq 1$, we
+say that the algorithm \em{fails} iff
+$\vert Y_t \cdot 2^t - d \vert > \varepsilon d$. Rearranging, we have that the
+algorithm fails iff:
+$$ \left\vert Y_t - {d \over 2^t} \right\vert \geq {\varepsilon d \over 2^t} $$
+To bound the probability of this event, we will sum over all possible values
+$r \in [\log n]$ that $t$ can take. Note that for \em{small} values of $r$,
+a failure is unlikely when $t = r$, since the required deviation $d / 2^t$ is
+large. For \em{large} values of $r$, simply achieving $t = r$ is difficult.
+More formally, let $s$ be the unique integer such that:
+$$ {12 \over \varepsilon^2} \leq {d \over 2^s} \leq {24 \over \varepsilon^2}$$
+Then we have:
+$$ \Pr[{\rm fail}] = \sum_{r = 1}^{\log n}
+ \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r}
+    \land t = r \right] $$
+After splitting the sum around $s$, we bound small and large values by different
+methods as described above to get:
+$$ \Pr[{\rm fail}] \leq \sum_{r = 1}^{s - 1} 
+ \Pr\left[ \left\vert Y_r - {d \over 2^r} \right\vert \geq {\varepsilon d \over 2^r} \right] +
+\sum_{r = s}^{\log n}
+    \Pr\left[t = r \right] $$
+Recall that $\E[Y_r] = d / 2^r$, so the terms in the first sum can be bounded
+using Chebyshev's inequality. The second sum is equal to the probability of
+the event $[t \geq s]$, that is, the event $Y_{s - 1} \geq c / \varepsilon^2$
+(since $z$ is only increased when $B$ becomes larger than this threshold).
+We will use Markov's inequality to bound the probability of this event.
+Putting it all together, we have:
+$$\eqalign{
+ \Pr[{\rm fail}] &\leq \sum_{r = 1}^{s - 1} 
+ {\Var[Y_r] \over (\varepsilon d / 2^r)^2}  + {\E[Y_{s - 1}] \over c / \varepsilon^2}
+ \leq \sum_{r = 1}^{s - 1}
+ {d / 2^r \over (\varepsilon d / 2^r)^2}  + {d / 2^{s - 1} \over c / \varepsilon^2}\cr
+ &= \sum_{r = 1}^{s - 1} {2^r \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
+ \leq {2^{s} \over \varepsilon^2 d} + {\varepsilon^2 d \over c2^{s - 1}}
+}
+$$
+Recalling the definition of $s$, we have $2^s / d \leq \varepsilon^2 / 12$, and
+$d / 2^{s - 1} \leq 48 / \varepsilon^2$, and hence:
+$$ \Pr[{\rm fail}] \leq {1 \over 12} + {48 \over c} $$
+which is smaller than (say) $1 / 6$ for $c > 576$. Hence the algorithm is an
+$(\varepsilon, 1 / 6)$-estimator.
+\qed
+As before, we can run $\O(\log \delta)$ independent copies of the algorithm,
+and take the median of their estimates to reduce the probability of failure
+to $\delta$. The only thing remaining is to look at the space usage of the
+algorithm.
+The counter $z$ requires only $\O(\log \log n)$ bits, and $B$ has
+$\O(1 / \varepsilon^2)$ entries, each of which needs $\O( \log n )$ bits.
+Finally, the hash function $h$ needs $\O(\log n)$ bits, so the total space
+used is dominated by $B$, and the algorithm uses $\O(\log n / \varepsilon^2)$
+space. As before, if we use the median trick, the space used increases to
+$\O(\log\delta \cdot \log n / \varepsilon^2)$.
+(TODO: include the version of this algorithm where we save space by storing
+$(g(a), {\tt tz}(h(a)))$ instead of $(a, {\tt tz}(h(a)))$ in $B$ for some
+hash function $g$ as an exercise?)
+\endchapter
--- a/tex/adsmac.tex
+++ b/tex/adsmac.tex
@@ -170,6 +170,7 @@
 \def\E{{\bb E}}
 \def\Pr{{\rm Pr}\mkern0.5mu}
 \def\Prsub#1{{\rm Pr}_{#1}}
+\def\Var{{\rm Var}\mkern0.5mu}
 % Vektory
 \def\t{{\bf t}}