diff --git a/fs-succinct/succinct.tex b/fs-succinct/succinct.tex index 4fea12c0af69ebaec086a987e3b2bef6c118b88b..730f8ed9a57a3b0559c6ec49baee907b9be53e9c 100644 --- a/fs-succinct/succinct.tex +++ b/fs-succinct/succinct.tex @@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$ (which is essentially the entropy of a uniform distribution over $X(n)$). -\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.} +Note: We will always ignore constant additive factors, so sometimes we will use the +definition $OPT(n) := \log |X(n)|$ (without rounding, differs by at most one from +the original definition) interchangably. + +\defn{{\I Redundancy} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.} Now we can define three classes of data structures based on their fine-grained space efficiency: \defn{A data structure is \tightlist{o} -\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$, +\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = \O(1)$, \:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$, \:{\I compact} when $s(n) \le \O(OPT(n))$. \endlist @@ -48,7 +52,7 @@ data structure. And of course, as with any data structure, we want to be able to perform reasonably fast operations on these space-efficient data structures. -\section{Succinct representation of strings} +\section{Representation of strings over arbitrary alphabet} Let us consider the problem of representing a length-$n$ string over alphabet $[m]$, for example a string of base-10 digits. The following two naive approaches immediately @@ -75,11 +79,43 @@ convert that number to binary). With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10 \rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k + \underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$, -redundance goes down, approaching the optimum but never quite reaching it. For a +redundancy goes down, approaching the optimum but never quite reaching it. For a fixed $k$ it is still linear and thus our scheme is not succinct. Also, with increasing $k$, local access time goes up. In practice, however, one could chose a good-compromise value for $k$ and happily use such a scheme. +We will develop a succinct encoding scheme later in this chapter. + +\section{Intermezzo: Prefix-free encoding of bit strings} + +Let us forget about arbitrary alphabets for a moment and consider a different +problem. We want to encode a binary string of arbitrary length in a way that +allows the decoder to determine when the string ends (it can be followed by +arbitrary other data). Furthermore, we want this to be a streaming encoding +-- i.e., encode the string piece by piece while it is being read from the input. +The length of the string is not known in advance -- it will only be determined +when the input reaches its end\foot{If the length were known in advance, we could +simply store the length using any simple variable-size number encoding, followed by the +string data itself. This would give us $\O(\log n)$ redundancy almost for free.} + +A trivial solution might be to split the string into $b$-bit blocks and encode +each of them into a $(b+1)$-bit block with a simple padding scheme: +\tightlist{o} +\: For a complete block, output its $b$ data bits followed by a zero. +\: For an incomplete final block, output its data bits, followed by a zero + and then as many ones as needed to reach $b+1$ bits. +\: If the final block is complete (input length is divisible by $b$), we must + add an extra padding-only block (zero followed by $b$ ones) to signal the + end of the string. +\endlist + +The redundancy of such encoding is at most $n/b + b + 1$ (one bit per block, +$b+1$ for extra padding block). For a fixed $b$, this is $\Theta(n)$, so the +scheme is not succinct. + +\subsection{SOLE (Short-Odd-Long-Even) Encoding} + +\section{Succinct representation of arbitrary-alphabet strings} \endchapter