Succinct: SOLE intro

b160f3eb · Filip Stedronsky · c6d1db62 · b160f3eb
Commit b160f3eb authored 3 years ago by Filip Stedronsky
--- a/fs-succinct/succinct.tex
+++ b/fs-succinct/succinct.tex
@@ -101,11 +101,11 @@ string data itself. This would give us $\O(\log n)$ redundancy almost for free.}
 A trivial solution might be to split the string into $b$-bit blocks and encode
 each of them into a $(b+1)$-bit block with a simple padding scheme:
 \tightlist{o}
-\: For a complete block, output its $b$ data bits followed by a zero.
-\: For an incomplete final block, output its data bits, followed by a zero
-   and then as many ones as needed to reach $b+1$ bits.
+\: For a complete block, output its $b$ data bits followed by a one.
+\: For an incomplete final block, output its data bits, followed by a one
+   and then as many zeros as needed to reach $b+1$ bits.
 \: If the final block is complete (input length is divisible by $b$), we must
-   add an extra padding-only block (zero followed by $b$ ones) to signal the
+   add an extra padding-only block (one followed by $b$ zeros) to signal the
   end of the string.
 \endlist

@@ -115,6 +115,48 @@ scheme is not succinct.

 \subsection{SOLE (Short-Odd-Long-Even) Encoding}

+In this section we will present a more advanced prefix-free string encoding
+that will be succinct.
+
+First, we split the input into $b$-bit blocks. We will add a padding in the
+form of $10\cdots0$ at the end of the last block to make it $b$ bits long.
+If the last block was complete, we must add an extra padding-only block to
+make the padding scheme reversible.
+
+Now we will consider each block as a single character from the alphabet $[B]$,
+where $B:=2^b$. Then we shall extend this alphabet by adding a special EOF
+character. We will add this character at the end of encoding. This gives us
+a new string from the alphabet $[B+1]$ that has length at most $n/b + 2$
+($+1$ for padding, $+1$ for added EOF character).
+
+However, as $B+1$ is not a power of two, now we have a question of how to
+encode this string. Note that this is a special case of the problem stated
+above, i.e. encoding a string from an arbitrary alphabet. We will try to solve
+this special case as a warm-up and then move on to a fully general solution.
+
+First, we need to introduce a new concept: re-encoding character pairs into
+different alphabets. Let's assume for example, that we have two characters from
+alphabets [11] and [8], respectivelly. We can turn them into one character from
+the alphabet [88] (by the simple transformation of $8x + y$). We can then split
+that character again into two in a different way. For example into two characters
+from alphabets [9] and [10]. This can be accomplished by simple division with
+remainder: if the original character is $z\in [88]$, we transform in into
+$\lfloor z / 10\rfloor$ and $(z \;{\rm mod}\; 10)$. For example, if we start
+with the characters 6 and 5, they first get combined to form $6\cdot 8 + 5 = 53$
+and then split into 5 and 3.
+
+We can think of these two steps as a single transformation that takes
+two characters from alphabets [11] and [8] and transforms them into
+two characters from alphabets [9] and [10]. More generally, we can
+always transform a pair of characters from alphabets $[A]$ and $[B]$
+into a pair from alphabets $[C]$ and $[D]$ as long as $C\cdot D
+\ge A \cdot B$ (we need an output universe large enough to hold all
+possible input combinations).
+
+We will use this kind of alphabet re-encoding by pair heavily in the SOLE
+encoding. The best way to explain the exact scheme is with a diagram:
+
+
 \section{Succinct representation of arbitrary-alphabet strings}