Succinct: prefix-free encoding intro

c6d1db62 · Filip Stedronsky · 6b3f2cd4 · c6d1db62
Commit c6d1db62 authored 3 years ago by Filip Stedronsky
--- a/fs-succinct/succinct.tex
+++ b/fs-succinct/succinct.tex
@@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
 The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$
 (which is essentially the entropy of a uniform distribution over $X(n)$).
-\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
+Note: We will always ignore constant additive factors, so sometimes we will use the
+definition $OPT(n) := \log |X(n)|$ (without rounding, differs by at most one from
+the original definition) interchangably.
+\defn{{\I Redundancy} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
 Now we can define three classes of data structures based on their fine-grained space
 efficiency:
 \defn{A data structure is
 \tightlist{o}
-\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$,
+\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = \O(1)$,
 \:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$,
 \:{\I compact} when $s(n) \le \O(OPT(n))$.
 \endlist
@@ -48,7 +52,7 @@ data structure.
 And of course, as with any data structure, we want to be able to perform reasonably
 fast operations on these space-efficient data structures.
-\section{Succinct representation of strings}
+\section{Representation of strings over arbitrary alphabet}
 Let us consider the problem of representing a length-$n$ string over alphabet $[m]$,
 for example a string of base-10 digits. The following two naive approaches immediately
@@ -75,11 +79,43 @@ convert that number to binary).
 With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10
 \rceil \le (n/k + 1)(k \log 10 + 1)  = \underbrace{n \log 10}_{OPT(n)} + n/k +
 \underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$,
-redundance goes down, approaching the optimum but never quite reaching it. For a
+redundancy goes down, approaching the optimum but never quite reaching it. For a
 fixed $k$ it is still linear and thus our scheme is not succinct. Also, with
 increasing $k$, local access time goes up. In practice, however, one could
 chose a good-compromise value for $k$ and happily use such a scheme.
+We will develop a succinct encoding scheme later in this chapter.
+\section{Intermezzo: Prefix-free encoding of bit strings}
+Let us forget about arbitrary alphabets for a moment and consider a different
+problem. We want to encode a binary string of arbitrary length in a way that
+allows the decoder to determine when the string ends (it can be followed by
+arbitrary other data). Furthermore, we want this to be a streaming encoding
+-- i.e., encode the string piece by piece while it is being read from the input.
+The length of the string is not known in advance -- it will only be determined
+when the input reaches its end\foot{If the length were known in advance, we could
+simply store the length using any simple variable-size number encoding, followed by the
+string data itself. This would give us $\O(\log n)$ redundancy almost for free.}
+A trivial solution might be to split the string into $b$-bit blocks and encode
+each of them into a $(b+1)$-bit block with a simple padding scheme:
+\tightlist{o}
+\: For a complete block, output its $b$ data bits followed by a zero.
+\: For an incomplete final block, output its data bits, followed by a zero
+   and then as many ones as needed to reach $b+1$ bits.
+\: If the final block is complete (input length is divisible by $b$), we must
+   add an extra padding-only block (zero followed by $b$ ones) to signal the
+   end of the string.
+\endlist
+The redundancy of such encoding is at most $n/b + b + 1$ (one bit per block,
+$b+1$ for extra padding block). For a fixed $b$, this is $\Theta(n)$, so the
+scheme is not succinct.
+\subsection{SOLE (Short-Odd-Long-Even) Encoding}
+\section{Succinct representation of arbitrary-alphabet strings}
 \endchapter