Commit 6b3f2cd4 by Filip Stedronsky

### Succint: strings intro, naive encoding, practical encoding by groups

parent dbb01577
 ... ... @@ -2,6 +2,7 @@ \input adsmac.tex \singlechapter{50} \fi \input tabto.tex \chapter[succinct]{Space-efficient data structures} ... ... @@ -21,13 +22,15 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$ (which is essentially the entropy of a uniform distribution over $X(n)$). \defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.} Now we can define three classes of data structures based on their fine-grained space efficiency: \defn{A data structure is \tightlist{o} \:{\I implicit} when $s(n) \le OPT(n) + \O(1)$, \:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$, \:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$, \:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$, \:{\I compact} when $s(n) \le \O(OPT(n))$. \endlist } ... ... @@ -47,6 +50,36 @@ fast operations on these space-efficient data structures. \section{Succinct representation of strings} Let us consider the problem of representing a length-$n$ string over alphabet $[m]$, for example a string of base-10 digits. The following two naive approaches immediately come to mind: \list{(a)} \: Consider the whole string as one base-10 number and convert that number into binary. This achieves the information-theoretically optimum size of $OPT(n) = \lceil n \log 10 \rceil \approx 3.32n = \Theta(n+1)$. However, this representation does not support local decoding and modification -- you must always decode and re-encode the whole string. \: Store the string digit-by-digit. This uses space $n \lceil \log 10 \rceil = 4n = OPT(n) + \Theta(n)$. For a fixed alphabet size, this is not succinct because $\Theta(n) > o(OPT(n)) = o(n + 1)$\foot{More formally, if we consider $\O$ and $o$ to be sets of functions, $\Theta(n) \cap o(n + 1) = \emptyset$.}. However, we get constant-time local decoding and modification for free. \endlist We would like to get the best of both worlds -- achieve close-to-optimum space requirements while also supporting constant-time local decoding and modification. A simple solution that may work in practice is to encode the digits in groups (e.g. encode each 2 subsequent digits into one number from the range [100] and convert that number to binary). With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10 \rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k + \underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$, redundance goes down, approaching the optimum but never quite reaching it. For a fixed $k$ it is still linear and thus our scheme is not succinct. Also, with increasing $k$, local access time goes up. In practice, however, one could chose a good-compromise value for $k$ and happily use such a scheme. \endchapter
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!