Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
datovky
ds2-notes
Commits
c6d1db62
Commit
c6d1db62
authored
Aug 29, 2021
by
Filip Stedronsky
Browse files
Succinct: prefix-free encoding intro
parent
6b3f2cd4
Changes
1
Hide whitespace changes
Inline
Side-by-side
fs-succinct/succinct.tex
View file @
c6d1db62
...
...
@@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
The information-theoretical optimum is
$
OPT
(
n
)
:
=
\lceil\log
|X
(
n
)
|
\rceil
$
(which is essentially the entropy of a uniform distribution over
$
X
(
n
)
$
).
\defn
{{
\I
Redundance
}
of a space-efficient data structure is
$
r
(
n
)
:
=
s
(
n
)
-
OPT
(
n
)
$
.
}
Note: We will always ignore constant additive factors, so sometimes we will use the
definition
$
OPT
(
n
)
:
=
\log
|X
(
n
)
|
$
(without rounding, differs by at most one from
the original definition) interchangably.
\defn
{{
\I
Redundancy
}
of a space-efficient data structure is
$
r
(
n
)
:
=
s
(
n
)
-
OPT
(
n
)
$
.
}
Now we can define three classes of data structures based on their fine-grained space
efficiency:
\defn
{
A data structure is
\tightlist
{
o
}
\:
{
\I
implicit
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
\O
(
1
)
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
O
(
1
)
$
,
\:
{
\I
implicit
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
\O
(
1
)
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
\
O
(
1
)
$
,
\:
{
\I
succinct
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
{
\rm
o
}
(
OPT
(
n
))
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
{
\rm
o
}
(
OPT
(
n
))
$
,
\:
{
\I
compact
}
when
$
s
(
n
)
\le
\O
(
OPT
(
n
))
$
.
\endlist
...
...
@@ -48,7 +52,7 @@ data structure.
And of course, as with any data structure, we want to be able to perform reasonably
fast operations on these space-efficient data structures.
\section
{
Succinct r
epresentation of strings
}
\section
{
R
epresentation of strings
over arbitrary alphabet
}
Let us consider the problem of representing a length-
$
n
$
string over alphabet
$
[
m
]
$
,
for example a string of base-10 digits. The following two naive approaches immediately
...
...
@@ -75,11 +79,43 @@ convert that number to binary).
With groups of size
$
k
$
, we get
$$
s
(
n
)
=
\lceil
n
/
k
\rceil
\lceil
k
\log
10
\rceil
\le
(
n
/
k
+
1
)(
k
\log
10
+
1
)
=
\underbrace
{
n
\log
10
}_{
OPT
(
n
)
}
+
n
/
k
+
\underbrace
{
k
\log
10
+
1
}_{
\O
(
1
)
}
.
$$
Thus we see that with increasing
$
k
$
,
redundanc
e
goes down, approaching the optimum but never quite reaching it. For a
redundanc
y
goes down, approaching the optimum but never quite reaching it. For a
fixed
$
k
$
it is still linear and thus our scheme is not succinct. Also, with
increasing
$
k
$
, local access time goes up. In practice, however, one could
chose a good-compromise value for
$
k
$
and happily use such a scheme.
We will develop a succinct encoding scheme later in this chapter.
\section
{
Intermezzo: Prefix-free encoding of bit strings
}
Let us forget about arbitrary alphabets for a moment and consider a different
problem. We want to encode a binary string of arbitrary length in a way that
allows the decoder to determine when the string ends (it can be followed by
arbitrary other data). Furthermore, we want this to be a streaming encoding
-- i.e., encode the string piece by piece while it is being read from the input.
The length of the string is not known in advance -- it will only be determined
when the input reaches its end
\foot
{
If the length were known in advance, we could
simply store the length using any simple variable-size number encoding, followed by the
string data itself. This would give us
$
\O
(
\log
n
)
$
redundancy almost for free.
}
A trivial solution might be to split the string into
$
b
$
-bit blocks and encode
each of them into a
$
(
b
+
1
)
$
-bit block with a simple padding scheme:
\tightlist
{
o
}
\:
For a complete block, output its
$
b
$
data bits followed by a zero.
\:
For an incomplete final block, output its data bits, followed by a zero
and then as many ones as needed to reach
$
b
+
1
$
bits.
\:
If the final block is complete (input length is divisible by
$
b
$
), we must
add an extra padding-only block (zero followed by
$
b
$
ones) to signal the
end of the string.
\endlist
The redundancy of such encoding is at most
$
n
/
b
+
b
+
1
$
(one bit per block,
$
b
+
1
$
for extra padding block). For a fixed
$
b
$
, this is
$
\Theta
(
n
)
$
, so the
scheme is not succinct.
\subsection
{
SOLE (Short-Odd-Long-Even) Encoding
}
\section
{
Succinct representation of arbitrary-alphabet strings
}
\endchapter
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment