Skip to content
Snippets Groups Projects
Commit c6d1db62 authored by Filip Stedronsky's avatar Filip Stedronsky
Browse files

Succinct: prefix-free encoding intro

parent 6b3f2cd4
No related branches found
No related tags found
No related merge requests found
...@@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur ...@@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$ The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$
(which is essentially the entropy of a uniform distribution over $X(n)$). (which is essentially the entropy of a uniform distribution over $X(n)$).
\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.} Note: We will always ignore constant additive factors, so sometimes we will use the
definition $OPT(n) := \log |X(n)|$ (without rounding, differs by at most one from
the original definition) interchangably.
\defn{{\I Redundancy} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
Now we can define three classes of data structures based on their fine-grained space Now we can define three classes of data structures based on their fine-grained space
efficiency: efficiency:
\defn{A data structure is \defn{A data structure is
\tightlist{o} \tightlist{o}
\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$, \:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = \O(1)$,
\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$, \:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$,
\:{\I compact} when $s(n) \le \O(OPT(n))$. \:{\I compact} when $s(n) \le \O(OPT(n))$.
\endlist \endlist
...@@ -48,7 +52,7 @@ data structure. ...@@ -48,7 +52,7 @@ data structure.
And of course, as with any data structure, we want to be able to perform reasonably And of course, as with any data structure, we want to be able to perform reasonably
fast operations on these space-efficient data structures. fast operations on these space-efficient data structures.
\section{Succinct representation of strings} \section{Representation of strings over arbitrary alphabet}
Let us consider the problem of representing a length-$n$ string over alphabet $[m]$, Let us consider the problem of representing a length-$n$ string over alphabet $[m]$,
for example a string of base-10 digits. The following two naive approaches immediately for example a string of base-10 digits. The following two naive approaches immediately
...@@ -75,11 +79,43 @@ convert that number to binary). ...@@ -75,11 +79,43 @@ convert that number to binary).
With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10 With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10
\rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k + \rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k +
\underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$, \underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$,
redundance goes down, approaching the optimum but never quite reaching it. For a redundancy goes down, approaching the optimum but never quite reaching it. For a
fixed $k$ it is still linear and thus our scheme is not succinct. Also, with fixed $k$ it is still linear and thus our scheme is not succinct. Also, with
increasing $k$, local access time goes up. In practice, however, one could increasing $k$, local access time goes up. In practice, however, one could
chose a good-compromise value for $k$ and happily use such a scheme. chose a good-compromise value for $k$ and happily use such a scheme.
We will develop a succinct encoding scheme later in this chapter.
\section{Intermezzo: Prefix-free encoding of bit strings}
Let us forget about arbitrary alphabets for a moment and consider a different
problem. We want to encode a binary string of arbitrary length in a way that
allows the decoder to determine when the string ends (it can be followed by
arbitrary other data). Furthermore, we want this to be a streaming encoding
-- i.e., encode the string piece by piece while it is being read from the input.
The length of the string is not known in advance -- it will only be determined
when the input reaches its end\foot{If the length were known in advance, we could
simply store the length using any simple variable-size number encoding, followed by the
string data itself. This would give us $\O(\log n)$ redundancy almost for free.}
A trivial solution might be to split the string into $b$-bit blocks and encode
each of them into a $(b+1)$-bit block with a simple padding scheme:
\tightlist{o}
\: For a complete block, output its $b$ data bits followed by a zero.
\: For an incomplete final block, output its data bits, followed by a zero
and then as many ones as needed to reach $b+1$ bits.
\: If the final block is complete (input length is divisible by $b$), we must
add an extra padding-only block (zero followed by $b$ ones) to signal the
end of the string.
\endlist
The redundancy of such encoding is at most $n/b + b + 1$ (one bit per block,
$b+1$ for extra padding block). For a fixed $b$, this is $\Theta(n)$, so the
scheme is not succinct.
\subsection{SOLE (Short-Odd-Long-Even) Encoding}
\section{Succinct representation of arbitrary-alphabet strings}
\endchapter \endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment