Commit b160f3eb authored by Filip Stedronsky's avatar Filip Stedronsky
Browse files

Succinct: SOLE intro

parent c6d1db62
......@@ -101,11 +101,11 @@ string data itself. This would give us $\O(\log n)$ redundancy almost for free.}
A trivial solution might be to split the string into $b$-bit blocks and encode
each of them into a $(b+1)$-bit block with a simple padding scheme:
\tightlist{o}
\: For a complete block, output its $b$ data bits followed by a zero.
\: For an incomplete final block, output its data bits, followed by a zero
and then as many ones as needed to reach $b+1$ bits.
\: For a complete block, output its $b$ data bits followed by a one.
\: For an incomplete final block, output its data bits, followed by a one
and then as many zeros as needed to reach $b+1$ bits.
\: If the final block is complete (input length is divisible by $b$), we must
add an extra padding-only block (zero followed by $b$ ones) to signal the
add an extra padding-only block (one followed by $b$ zeros) to signal the
end of the string.
\endlist
......@@ -115,6 +115,48 @@ scheme is not succinct.
\subsection{SOLE (Short-Odd-Long-Even) Encoding}
In this section we will present a more advanced prefix-free string encoding
that will be succinct.
First, we split the input into $b$-bit blocks. We will add a padding in the
form of $10\cdots0$ at the end of the last block to make it $b$ bits long.
If the last block was complete, we must add an extra padding-only block to
make the padding scheme reversible.
Now we will consider each block as a single character from the alphabet $[B]$,
where $B:=2^b$. Then we shall extend this alphabet by adding a special EOF
character. We will add this character at the end of encoding. This gives us
a new string from the alphabet $[B+1]$ that has length at most $n/b + 2$
($+1$ for padding, $+1$ for added EOF character).
However, as $B+1$ is not a power of two, now we have a question of how to
encode this string. Note that this is a special case of the problem stated
above, i.e. encoding a string from an arbitrary alphabet. We will try to solve
this special case as a warm-up and then move on to a fully general solution.
First, we need to introduce a new concept: re-encoding character pairs into
different alphabets. Let's assume for example, that we have two characters from
alphabets [11] and [8], respectivelly. We can turn them into one character from
the alphabet [88] (by the simple transformation of $8x + y$). We can then split
that character again into two in a different way. For example into two characters
from alphabets [9] and [10]. This can be accomplished by simple division with
remainder: if the original character is $z\in [88]$, we transform in into
$\lfloor z / 10\rfloor$ and $(z \;{\rm mod}\; 10)$. For example, if we start
with the characters 6 and 5, they first get combined to form $6\cdot 8 + 5 = 53$
and then split into 5 and 3.
We can think of these two steps as a single transformation that takes
two characters from alphabets [11] and [8] and transforms them into
two characters from alphabets [9] and [10]. More generally, we can
always transform a pair of characters from alphabets $[A]$ and $[B]$
into a pair from alphabets $[C]$ and $[D]$ as long as $C\cdot D
\ge A \cdot B$ (we need an output universe large enough to hold all
possible input combinations).
We will use this kind of alphabet re-encoding by pair heavily in the SOLE
encoding. The best way to explain the exact scheme is with a diagram:
\section{Succinct representation of arbitrary-alphabet strings}
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment