diff --git a/fs-succinct/succinct.tex b/fs-succinct/succinct.tex index b6847109fdf980e12f4023e174de928e9d46940a..4fea12c0af69ebaec086a987e3b2bef6c118b88b 100644 --- a/fs-succinct/succinct.tex +++ b/fs-succinct/succinct.tex @@ -2,6 +2,7 @@ \input adsmac.tex \singlechapter{50} \fi +\input tabto.tex \chapter[succinct]{Space-efficient data structures} @@ -21,13 +22,15 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$ (which is essentially the entropy of a uniform distribution over $X(n)$). +\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.} + Now we can define three classes of data structures based on their fine-grained space efficiency: \defn{A data structure is \tightlist{o} -\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$, -\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$, +\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$, +\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$, \:{\I compact} when $s(n) \le \O(OPT(n))$. \endlist } @@ -47,6 +50,36 @@ fast operations on these space-efficient data structures. \section{Succinct representation of strings} +Let us consider the problem of representing a length-$n$ string over alphabet $[m]$, +for example a string of base-10 digits. The following two naive approaches immediately +come to mind: + +\list{(a)} +\: Consider the whole string as one base-10 number and convert that number into binary. + This achieves the information-theoretically optimum size of $OPT(n) = \lceil n \log 10 \rceil + \approx 3.32n = \Theta(n+1)$. However, this representation does not support local decoding and + modification -- you must always decode and re-encode the whole string. +\: Store the string digit-by-digit. This uses space $n \lceil \log 10 \rceil = 4n = OPT(n) + \Theta(n)$. + For a fixed alphabet size, this is not succinct because $\Theta(n) > o(OPT(n)) = o(n + 1)$\foot{More + formally, if we consider $\O$ and $o$ to be sets of functions, $\Theta(n) \cap o(n + 1) = \emptyset$.}. + However, we get constant-time local decoding and modification for free. +\endlist + +We would like to get the best of both worlds -- achieve close-to-optimum space +requirements while also supporting constant-time local decoding and modification. + +A simple solution that may work in practice is to encode the digits in groups +(e.g. encode each 2 subsequent digits into one number from the range [100] and +convert that number to binary). + +With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10 +\rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k + +\underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$, +redundance goes down, approaching the optimum but never quite reaching it. For a +fixed $k$ it is still linear and thus our scheme is not succinct. Also, with +increasing $k$, local access time goes up. In practice, however, one could +chose a good-compromise value for $k$ and happily use such a scheme. + \endchapter