Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
c6d1db62
Commit
c6d1db62
authored
3 years ago
by
Filip Stedronsky
Browse files
Options
Downloads
Patches
Plain Diff
Succinct: prefix-free encoding intro
parent
6b3f2cd4
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
fs-succinct/succinct.tex
+40
-4
40 additions, 4 deletions
fs-succinct/succinct.tex
with
40 additions
and
4 deletions
fs-succinct/succinct.tex
+
40
−
4
View file @
c6d1db62
...
...
@@ -22,14 +22,18 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
The information-theoretical optimum is
$
OPT
(
n
)
:
=
\lceil\log
|X
(
n
)
|
\rceil
$
(which is essentially the entropy of a uniform distribution over
$
X
(
n
)
$
).
\defn
{{
\I
Redundance
}
of a space-efficient data structure is
$
r
(
n
)
:
=
s
(
n
)
-
OPT
(
n
)
$
.
}
Note: We will always ignore constant additive factors, so sometimes we will use the
definition
$
OPT
(
n
)
:
=
\log
|X
(
n
)
|
$
(without rounding, differs by at most one from
the original definition) interchangably.
\defn
{{
\I
Redundancy
}
of a space-efficient data structure is
$
r
(
n
)
:
=
s
(
n
)
-
OPT
(
n
)
$
.
}
Now we can define three classes of data structures based on their fine-grained space
efficiency:
\defn
{
A data structure is
\tightlist
{
o
}
\:
{
\I
implicit
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
\O
(
1
)
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
O
(
1
)
$
,
\:
{
\I
implicit
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
\O
(
1
)
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
\
O
(
1
)
$
,
\:
{
\I
succinct
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
{
\rm
o
}
(
OPT
(
n
))
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
{
\rm
o
}
(
OPT
(
n
))
$
,
\:
{
\I
compact
}
when
$
s
(
n
)
\le
\O
(
OPT
(
n
))
$
.
\endlist
...
...
@@ -48,7 +52,7 @@ data structure.
And of course, as with any data structure, we want to be able to perform reasonably
fast operations on these space-efficient data structures.
\section
{
Succinct r
epresentation of strings
}
\section
{
R
epresentation of strings
over arbitrary alphabet
}
Let us consider the problem of representing a length-
$
n
$
string over alphabet
$
[
m
]
$
,
for example a string of base-10 digits. The following two naive approaches immediately
...
...
@@ -75,11 +79,43 @@ convert that number to binary).
With groups of size
$
k
$
, we get
$$
s
(
n
)
=
\lceil
n
/
k
\rceil
\lceil
k
\log
10
\rceil
\le
(
n
/
k
+
1
)(
k
\log
10
+
1
)
=
\underbrace
{
n
\log
10
}_{
OPT
(
n
)
}
+
n
/
k
+
\underbrace
{
k
\log
10
+
1
}_{
\O
(
1
)
}
.
$$
Thus we see that with increasing
$
k
$
,
redundanc
e
goes down, approaching the optimum but never quite reaching it. For a
redundanc
y
goes down, approaching the optimum but never quite reaching it. For a
fixed
$
k
$
it is still linear and thus our scheme is not succinct. Also, with
increasing
$
k
$
, local access time goes up. In practice, however, one could
chose a good-compromise value for
$
k
$
and happily use such a scheme.
We will develop a succinct encoding scheme later in this chapter.
\section
{
Intermezzo: Prefix-free encoding of bit strings
}
Let us forget about arbitrary alphabets for a moment and consider a different
problem. We want to encode a binary string of arbitrary length in a way that
allows the decoder to determine when the string ends (it can be followed by
arbitrary other data). Furthermore, we want this to be a streaming encoding
-- i.e., encode the string piece by piece while it is being read from the input.
The length of the string is not known in advance -- it will only be determined
when the input reaches its end
\foot
{
If the length were known in advance, we could
simply store the length using any simple variable-size number encoding, followed by the
string data itself. This would give us
$
\O
(
\log
n
)
$
redundancy almost for free.
}
A trivial solution might be to split the string into
$
b
$
-bit blocks and encode
each of them into a
$
(
b
+
1
)
$
-bit block with a simple padding scheme:
\tightlist
{
o
}
\:
For a complete block, output its
$
b
$
data bits followed by a zero.
\:
For an incomplete final block, output its data bits, followed by a zero
and then as many ones as needed to reach
$
b
+
1
$
bits.
\:
If the final block is complete (input length is divisible by
$
b
$
), we must
add an extra padding-only block (zero followed by
$
b
$
ones) to signal the
end of the string.
\endlist
The redundancy of such encoding is at most
$
n
/
b
+
b
+
1
$
(one bit per block,
$
b
+
1
$
for extra padding block). For a fixed
$
b
$
, this is
$
\Theta
(
n
)
$
, so the
scheme is not succinct.
\subsection
{
SOLE (Short-Odd-Long-Even) Encoding
}
\section
{
Succinct representation of arbitrary-alphabet strings
}
\endchapter
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment