Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
D
ds2-notes
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Deploy
Releases
Model registry
Analyze
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
datovky
ds2-notes
Commits
badfe5b3
Commit
badfe5b3
authored
May 26, 2019
by
Martin Mareš
Browse files
Options
Downloads
Patches
Plain Diff
Strings: Suffix array by doubling
parent
afeba22c
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
08-string/string.tex
+47
-1
47 additions, 1 deletion
08-string/string.tex
with
47 additions
and
1 deletion
08-string/string.tex
+
47
−
1
View file @
badfe5b3
...
@@ -191,6 +191,52 @@ So the total time spent in the while loops is also $\O(n)$.
...
@@ -191,6 +191,52 @@ So the total time spent in the while loops is also $\O(n)$.
\subsection
{
Construction of the suffix array by doubling
}
\subsection
{
Construction of the suffix array by doubling
}
TODO
There is a~simple algorithm which builds the suffix array in
$
\O
(
n
\log
n
)
$
time.
As before,
$
\alpha
$
~will denote the input string and
$
n
$
~its length. Suffixes will
be represented by their starting position:
$
\alpha
_
i
$
~denotes the suffix
$
\alpha
[
i:
{}
]
$
.
The algorithm works in
$
\O
(
\log
n
)
$
passes, which sort suffixes by their first~
$
k
$
characters, where
$
k
=
2
^
0
,
2
^
1
,
2
^
2
,
\ldots
$
For simplicity, we will index passes
by~
$
k
$
.
\defn
{
For any two strings
$
\gamma
$
and~
$
\delta
$
, we define comparison of prefixes
of length~
$
k
$
:
$
\gamma
=
_
k
\delta
$
if
$
\gamma
[
{}
:k
]
=
\delta
[
{}
:k
]
$
,
$
\gamma
\le
_
k
\delta
$
if
$
\gamma
[
{}
:k
]
\le
\delta
[
{}
:k
]
$
.
}
The
$
k
$
-th pass will produce a~permutation~
$
S
_
k
$
on suffix positions, which sorts
suffixes by~
$
\le
_
k
$
. We can easily compute the corresponding ranking array~
$
R
_
k
$
, but this time
we have to be careful to assign the same rank to suffixes which are equal by~
$
=
_
k
$
.
Formally,
$
R
_
k
[
i
]
$
is the number of suffixes~
$
\alpha
_
j
$
such that
$
\alpha
_
j <
_
k
\alpha
_
i
$
.
In the first pass, we sort suffixes by their first character. Since the alphabet
can be arbitrarily large, this might require a~general-purpose sorting algorithm,
so we reserve
$
\O
(
n
\log
n
)
$
time for this step. The same time obviously suffices
for construction of the ranking array.
In the
$
2
k
$
-th pass, we get suffixes ordered by
$
\le
_
k
$
and we want to sort them by
$
\le
_{
2
k
}$
.
For any two suffixes
$
\alpha
_
i
$
and~
$
\alpha
_
j
$
, the following holds by definition of lexicographic order:
$$
\alpha
_
i
\le
_{
2
k
}
\alpha
_
j
\Longleftrightarrow
(
\alpha
_
i <
_
k
\alpha
_
j
)
\lor
(
\alpha
_
i
=
_
k
\alpha
_
j
)
\land
(
\alpha
_{
i
+
k
}
\le
_
k
\alpha
_{
j
+
k
}
)
.
$$
Using the ranking function~
$
R
_
k
$
, we can write this as lexicographic comparison
of pairs
$
(
R
_
k
[
i
]
, R
_
k
[
i
+
k
])
$
and
$
(
R
_
k
[
j
]
, R
_
k
[
j
+
k
])
$
.
We can therefore assign one such pair to each suffix and sort suffixes by these
pairs. Since any two pairs can be compared in constant time, a~general-purpose
sorting algorithm sorts them in
$
\O
(
n
\log
n
)
$
time. Afterwards, the ranking array
can be constructed in linear time by scanning the sorted order.
Overall, we have
$
\O
(
\log
n
)
$
passes, each taking
$
\O
(
n
\log
n
)
$
time. The whole
algorithm therefore runs in
$
\O
(
n
\log
^
2
n
)
$
time. In each pass, we need to store
only the input string~
$
\alpha
$
, the ranking array from the previous step, the suffix
array of the current step, and the encoded pairs. All this fits in
$
\O
(
n
)
$
space.
We can improve time complexity by using Bucketsort to sort the pairs. As the pairs
contain only numbers between 0 and~
$
n
$
, we can sort in two passes with
$
n
$
~buckets.
This takes
$
\O
(
n
)
$
time, so the whole algorithm runs in
$
\O
(
n
\log
n
)
$
time. Please
note that the first pass still remains
$
\O
(
n
\log
n
)
$
, unless we can assume that the
alphabet is small enough to index buckets. Space complexity stays linear.
\endchapter
\endchapter
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
sign in
to comment