Succint: strings intro, naive encoding, practical encoding by groups

6b3f2cd4 · Filip Stedronsky · dbb01577 · 6b3f2cd4
Commit 6b3f2cd4 authored 3 years ago by Filip Stedronsky
--- a/fs-succinct/succinct.tex
+++ b/fs-succinct/succinct.tex
@@ -2,6 +2,7 @@
 \input adsmac.tex
 \singlechapter{50}
 \fi
+\input tabto.tex

 \chapter[succinct]{Space-efficient data structures}

@@ -21,13 +22,15 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
 The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$
 (which is essentially the entropy of a uniform distribution over $X(n)$).

+\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
+
 Now we can define three classes of data structures based on their fine-grained space
 efficiency:

 \defn{A data structure is
 \tightlist{o}
-\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,
-\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,
+\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$,
+\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$,
 \:{\I compact} when $s(n) \le \O(OPT(n))$.
 \endlist
 }
@@ -47,6 +50,36 @@ fast operations on these space-efficient data structures.

 \section{Succinct representation of strings}

+Let us consider the problem of representing a length-$n$ string over alphabet $[m]$,
+for example a string of base-10 digits. The following two naive approaches immediately
+come to mind:
+
+\list{(a)}
+\: Consider the whole string as one base-10 number and convert that number into binary.
+   This achieves the information-theoretically optimum size of $OPT(n) = \lceil n \log 10 \rceil
+   \approx 3.32n = \Theta(n+1)$. However, this representation does not support local decoding and
+   modification -- you must always decode and re-encode the whole string.
+\: Store the string digit-by-digit. This uses space $n \lceil \log 10 \rceil = 4n = OPT(n) + \Theta(n)$.
+   For a fixed alphabet size, this is not succinct because $\Theta(n) > o(OPT(n)) = o(n + 1)$\foot{More
+   formally, if we consider $\O$ and $o$ to be sets of functions, $\Theta(n) \cap o(n + 1) = \emptyset$.}.
+   However, we get constant-time local decoding and modification for free.
+\endlist
+
+We would like to get the best of both worlds -- achieve close-to-optimum space
+requirements while also supporting constant-time local decoding and modification.
+
+A simple solution that may work in practice is to encode the digits in groups
+(e.g. encode each 2 subsequent digits into one number from the range [100] and
+convert that number to binary).
+
+With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10
+\rceil \le (n/k + 1)(k \log 10 + 1)  = \underbrace{n \log 10}_{OPT(n)} + n/k +
+\underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$,
+redundance goes down, approaching the optimum but never quite reaching it. For a
+fixed $k$ it is still linear and thus our scheme is not succinct. Also, with
+increasing $k$, local access time goes up. In practice, however, one could
+chose a good-compromise value for $k$ and happily use such a scheme.
+


 \endchapter