diff --git a/fs-succinct/succinct.tex b/fs-succinct/succinct.tex index 730f8ed9a57a3b0559c6ec49baee907b9be53e9c..f856197852e33d578e63e9602e2bfbadcfa7df2f 100644 --- a/fs-succinct/succinct.tex +++ b/fs-succinct/succinct.tex @@ -101,11 +101,11 @@ string data itself. This would give us $\O(\log n)$ redundancy almost for free.} A trivial solution might be to split the string into $b$-bit blocks and encode each of them into a $(b+1)$-bit block with a simple padding scheme: \tightlist{o} -\: For a complete block, output its $b$ data bits followed by a zero. -\: For an incomplete final block, output its data bits, followed by a zero - and then as many ones as needed to reach $b+1$ bits. +\: For a complete block, output its $b$ data bits followed by a one. +\: For an incomplete final block, output its data bits, followed by a one + and then as many zeros as needed to reach $b+1$ bits. \: If the final block is complete (input length is divisible by $b$), we must - add an extra padding-only block (zero followed by $b$ ones) to signal the + add an extra padding-only block (one followed by $b$ zeros) to signal the end of the string. \endlist @@ -115,6 +115,48 @@ scheme is not succinct. \subsection{SOLE (Short-Odd-Long-Even) Encoding} +In this section we will present a more advanced prefix-free string encoding +that will be succinct. + +First, we split the input into $b$-bit blocks. We will add a padding in the +form of $10\cdots0$ at the end of the last block to make it $b$ bits long. +If the last block was complete, we must add an extra padding-only block to +make the padding scheme reversible. + +Now we will consider each block as a single character from the alphabet $[B]$, +where $B:=2^b$. Then we shall extend this alphabet by adding a special EOF +character. We will add this character at the end of encoding. This gives us +a new string from the alphabet $[B+1]$ that has length at most $n/b + 2$ +($+1$ for padding, $+1$ for added EOF character). + +However, as $B+1$ is not a power of two, now we have a question of how to +encode this string. Note that this is a special case of the problem stated +above, i.e. encoding a string from an arbitrary alphabet. We will try to solve +this special case as a warm-up and then move on to a fully general solution. + +First, we need to introduce a new concept: re-encoding character pairs into +different alphabets. Let's assume for example, that we have two characters from +alphabets [11] and [8], respectivelly. We can turn them into one character from +the alphabet [88] (by the simple transformation of $8x + y$). We can then split +that character again into two in a different way. For example into two characters +from alphabets [9] and [10]. This can be accomplished by simple division with +remainder: if the original character is $z\in [88]$, we transform in into +$\lfloor z / 10\rfloor$ and $(z \;{\rm mod}\; 10)$. For example, if we start +with the characters 6 and 5, they first get combined to form $6\cdot 8 + 5 = 53$ +and then split into 5 and 3. + +We can think of these two steps as a single transformation that takes +two characters from alphabets [11] and [8] and transforms them into +two characters from alphabets [9] and [10]. More generally, we can +always transform a pair of characters from alphabets $[A]$ and $[B]$ +into a pair from alphabets $[C]$ and $[D]$ as long as $C\cdot D +\ge A \cdot B$ (we need an output universe large enough to hold all +possible input combinations). + +We will use this kind of alphabet re-encoding by pair heavily in the SOLE +encoding. The best way to explain the exact scheme is with a diagram: + + \section{Succinct representation of arbitrary-alphabet strings}