Commit 6b3f2cd4 authored by Filip Stedronsky's avatar Filip Stedronsky
Browse files

Succint: strings intro, naive encoding, practical encoding by groups

parent dbb01577
......@@ -2,6 +2,7 @@
\input adsmac.tex
\input tabto.tex
\chapter[succinct]{Space-efficient data structures}
......@@ -21,13 +22,15 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
The information-theoretical optimum is $OPT(n) := \lceil\log |X(n)|\rceil$
(which is essentially the entropy of a uniform distribution over $X(n)$).
\defn{{\I Redundance} of a space-efficient data structure is $r(n) := s(n) - OPT(n)$.}
Now we can define three classes of data structures based on their fine-grained space
\defn{A data structure is
\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,
\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,
\:{\I implicit} when $s(n) \le OPT(n) + \O(1)$,\tabto{7.6cm}i.e., $r(n) = O(1)$,
\:{\I succinct} when $s(n) \le OPT(n) + {\rm o}(OPT(n))$,\tabto{7.6cm}i.e., $r(n) = {\rm o}(OPT(n))$,
\:{\I compact} when $s(n) \le \O(OPT(n))$.
......@@ -47,6 +50,36 @@ fast operations on these space-efficient data structures.
\section{Succinct representation of strings}
Let us consider the problem of representing a length-$n$ string over alphabet $[m]$,
for example a string of base-10 digits. The following two naive approaches immediately
come to mind:
\: Consider the whole string as one base-10 number and convert that number into binary.
This achieves the information-theoretically optimum size of $OPT(n) = \lceil n \log 10 \rceil
\approx 3.32n = \Theta(n+1)$. However, this representation does not support local decoding and
modification -- you must always decode and re-encode the whole string.
\: Store the string digit-by-digit. This uses space $n \lceil \log 10 \rceil = 4n = OPT(n) + \Theta(n)$.
For a fixed alphabet size, this is not succinct because $\Theta(n) > o(OPT(n)) = o(n + 1)$\foot{More
formally, if we consider $\O$ and $o$ to be sets of functions, $\Theta(n) \cap o(n + 1) = \emptyset$.}.
However, we get constant-time local decoding and modification for free.
We would like to get the best of both worlds -- achieve close-to-optimum space
requirements while also supporting constant-time local decoding and modification.
A simple solution that may work in practice is to encode the digits in groups
(e.g. encode each 2 subsequent digits into one number from the range [100] and
convert that number to binary).
With groups of size $k$, we get $$s(n) = \lceil n/k \rceil \lceil k \log 10
\rceil \le (n/k + 1)(k \log 10 + 1) = \underbrace{n \log 10}_{OPT(n)} + n/k +
\underbrace{k\log 10 + 1}_{\O(1)}.$$ Thus we see that with increasing $k$,
redundance goes down, approaching the optimum but never quite reaching it. For a
fixed $k$ it is still linear and thus our scheme is not succinct. Also, with
increasing $k$, local access time goes up. In practice, however, one could
chose a good-compromise value for $k$ and happily use such a scheme.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment