Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
6b3f2cd4
Commit
6b3f2cd4
authored
3 years ago
by
Filip Stedronsky
Browse files
Options
Downloads
Patches
Plain Diff
Succint: strings intro, naive encoding, practical encoding by groups
parent
dbb01577
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
fs-succinct/succinct.tex
+35
-2
35 additions, 2 deletions
fs-succinct/succinct.tex
with
35 additions
and
2 deletions
fs-succinct/succinct.tex
+
35
−
2
View file @
6b3f2cd4
...
...
@@ -2,6 +2,7 @@
\input
adsmac.tex
\singlechapter
{
50
}
\fi
\input
tabto.tex
\chapter
[succinct]
{
Space-efficient data structures
}
...
...
@@ -21,13 +22,15 @@ Let us denote $s(n)$ the number of bits needed to store a size-$n$ data structur
The information-theoretical optimum is
$
OPT
(
n
)
:
=
\lceil\log
|X
(
n
)
|
\rceil
$
(which is essentially the entropy of a uniform distribution over
$
X
(
n
)
$
).
\defn
{{
\I
Redundance
}
of a space-efficient data structure is
$
r
(
n
)
:
=
s
(
n
)
-
OPT
(
n
)
$
.
}
Now we can define three classes of data structures based on their fine-grained space
efficiency:
\defn
{
A data structure is
\tightlist
{
o
}
\:
{
\I
implicit
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
\O
(
1
)
$
,
\:
{
\I
succinct
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
{
\rm
o
}
(
OPT
(
n
))
$
,
\:
{
\I
implicit
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
\O
(
1
)
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
O
(
1
)
$
,
\:
{
\I
succinct
}
when
$
s
(
n
)
\le
OPT
(
n
)
+
{
\rm
o
}
(
OPT
(
n
))
$
,
\tabto
{
7.6cm
}
i.e.,
$
r
(
n
)
=
{
\rm
o
}
(
OPT
(
n
))
$
,
\:
{
\I
compact
}
when
$
s
(
n
)
\le
\O
(
OPT
(
n
))
$
.
\endlist
}
...
...
@@ -47,6 +50,36 @@ fast operations on these space-efficient data structures.
\section
{
Succinct representation of strings
}
Let us consider the problem of representing a length-
$
n
$
string over alphabet
$
[
m
]
$
,
for example a string of base-10 digits. The following two naive approaches immediately
come to mind:
\list
{
(a)
}
\:
Consider the whole string as one base-10 number and convert that number into binary.
This achieves the information-theoretically optimum size of
$
OPT
(
n
)
=
\lceil
n
\log
10
\rceil
\approx
3
.
32
n
=
\Theta
(
n
+
1
)
$
. However, this representation does not support local decoding and
modification -- you must always decode and re-encode the whole string.
\:
Store the string digit-by-digit. This uses space
$
n
\lceil
\log
10
\rceil
=
4
n
=
OPT
(
n
)
+
\Theta
(
n
)
$
.
For a fixed alphabet size, this is not succinct because
$
\Theta
(
n
)
> o
(
OPT
(
n
))
=
o
(
n
+
1
)
$
\foot
{
More
formally, if we consider
$
\O
$
and
$
o
$
to be sets of functions,
$
\Theta
(
n
)
\cap
o
(
n
+
1
)
=
\emptyset
$
.
}
.
However, we get constant-time local decoding and modification for free.
\endlist
We would like to get the best of both worlds -- achieve close-to-optimum space
requirements while also supporting constant-time local decoding and modification.
A simple solution that may work in practice is to encode the digits in groups
(e.g. encode each 2 subsequent digits into one number from the range [100] and
convert that number to binary).
With groups of size
$
k
$
, we get
$$
s
(
n
)
=
\lceil
n
/
k
\rceil
\lceil
k
\log
10
\rceil
\le
(
n
/
k
+
1
)(
k
\log
10
+
1
)
=
\underbrace
{
n
\log
10
}_{
OPT
(
n
)
}
+
n
/
k
+
\underbrace
{
k
\log
10
+
1
}_{
\O
(
1
)
}
.
$$
Thus we see that with increasing
$
k
$
,
redundance goes down, approaching the optimum but never quite reaching it. For a
fixed
$
k
$
it is still linear and thus our scheme is not succinct. Also, with
increasing
$
k
$
, local access time goes up. In practice, however, one could
chose a good-compromise value for
$
k
$
and happily use such a scheme.
\endchapter
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment