Commit c6d1db62 authored by Filip Stedronsky's avatar Filip Stedronsky
Browse files

Succinct: prefix-free encoding intro

parent 6b3f2cd4
......@@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$
(which is essentially the entropy of a uniform distribution over $X(n)$).
\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
Note: We will always ignore constant additive factors, so sometimes we will use the
definition $OPT(n) := \log |X(n)|$ (without rounding, differs by at most one from
the original definition) interchangably.
\defn{{\I Redundancy} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
Now we can define three classes of data structures based on their fine-grained space
\defn{A data structure is
\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$,
\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = \O(1)$,
\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$,
\:{\I compact} when $s(n) \le \O(OPT(n))$.
......@@ -48,7 +52,7 @@ data structure.
And of course, as with any data structure, we want to be able to perform reasonably
fast operations on these space-efficient data structures.
\section{Succinct representation of strings}
\section{Representation of strings over arbitrary alphabet}
Let us consider the problem of representing a length-$n$ string over alphabet $[m]$,
for example a string of base-10 digits. The following two naive approaches immediately
......@@ -75,11 +79,43 @@ convert that number to binary).
With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10
\rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k +
\underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$,
redundance goes down, approaching the optimum but never quite reaching it. For a
redundancy goes down, approaching the optimum but never quite reaching it. For a
fixed $k$ it is still linear and thus our scheme is not succinct. Also, with
increasing $k$, local access time goes up. In practice, however, one could
chose a good-compromise value for $k$ and happily use such a scheme.
We will develop a succinct encoding scheme later in this chapter.
\section{Intermezzo: Prefix-free encoding of bit strings}
Let us forget about arbitrary alphabets for a moment and consider a different
problem. We want to encode a binary string of arbitrary length in a way that
allows the decoder to determine when the string ends (it can be followed by
arbitrary other data). Furthermore, we want this to be a streaming encoding
-- i.e., encode the string piece by piece while it is being read from the input.
The length of the string is not known in advance -- it will only be determined
when the input reaches its end\foot{If the length were known in advance, we could
simply store the length using any simple variable-size number encoding, followed by the
string data itself. This would give us $\O(\log n)$ redundancy almost for free.}
A trivial solution might be to split the string into $b$-bit blocks and encode
each of them into a $(b+1)$-bit block with a simple padding scheme:
\: For a complete block, output its $b$ data bits followed by a zero.
\: For an incomplete final block, output its data bits, followed by a zero
and then as many ones as needed to reach $b+1$ bits.
\: If the final block is complete (input length is divisible by $b$), we must
add an extra padding-only block (zero followed by $b$ ones) to signal the
end of the string.
The redundancy of such encoding is at most $n/b + b + 1$ (one bit per block,
$b+1$ for extra padding block). For a fixed $b$, this is $\Theta(n)$, so the
scheme is not succinct.
\subsection{SOLE (Short-Odd-Long-Even) Encoding}
\section{Succinct representation of arbitrary-alphabet strings}
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment