### Succinct: SOLE intro

parent c6d1db62
 ... ... @@ -101,11 +101,11 @@ string data itself. This would give us $\O(\log n)$ redundancy almost for free.} A trivial solution might be to split the string into $b$-bit blocks and encode each of them into a $(b+1)$-bit block with a simple padding scheme: \tightlist{o} \: For a complete block, output its $b$ data bits followed by a zero. \: For an incomplete final block, output its data bits, followed by a zero and then as many ones as needed to reach $b+1$ bits. \: For a complete block, output its $b$ data bits followed by a one. \: For an incomplete final block, output its data bits, followed by a one and then as many zeros as needed to reach $b+1$ bits. \: If the final block is complete (input length is divisible by $b$), we must add an extra padding-only block (zero followed by $b$ ones) to signal the add an extra padding-only block (one followed by $b$ zeros) to signal the end of the string. \endlist ... ... @@ -115,6 +115,48 @@ scheme is not succinct. \subsection{SOLE (Short-Odd-Long-Even) Encoding} In this section we will present a more advanced prefix-free string encoding that will be succinct. First, we split the input into $b$-bit blocks. We will add a padding in the form of $10\cdots0$ at the end of the last block to make it $b$ bits long. If the last block was complete, we must add an extra padding-only block to make the padding scheme reversible. Now we will consider each block as a single character from the alphabet $[B]$, where $B:=2^b$. Then we shall extend this alphabet by adding a special EOF character. We will add this character at the end of encoding. This gives us a new string from the alphabet $[B+1]$ that has length at most $n/b + 2$ ($+1$ for padding, $+1$ for added EOF character). However, as $B+1$ is not a power of two, now we have a question of how to encode this string. Note that this is a special case of the problem stated above, i.e. encoding a string from an arbitrary alphabet. We will try to solve this special case as a warm-up and then move on to a fully general solution. First, we need to introduce a new concept: re-encoding character pairs into different alphabets. Let's assume for example, that we have two characters from alphabets  and , respectivelly. We can turn them into one character from the alphabet  (by the simple transformation of $8x + y$). We can then split that character again into two in a different way. For example into two characters from alphabets  and . This can be accomplished by simple division with remainder: if the original character is $z\in $, we transform in into $\lfloor z / 10\rfloor$ and $(z \;{\rm mod}\; 10)$. For example, if we start with the characters 6 and 5, they first get combined to form $6\cdot 8 + 5 = 53$ and then split into 5 and 3. We can think of these two steps as a single transformation that takes two characters from alphabets  and  and transforms them into two characters from alphabets  and . More generally, we can always transform a pair of characters from alphabets $[A]$ and $[B]$ into a pair from alphabets $[C]$ and $[D]$ as long as $C\cdot D \ge A \cdot B$ (we need an output universe large enough to hold all possible input combinations). We will use this kind of alphabet re-encoding by pair heavily in the SOLE encoding. The best way to explain the exact scheme is with a diagram: \section{Succinct representation of arbitrary-alphabet strings} ... ...
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!