Skip to content
Snippets Groups Projects
Commit badfe5b3 authored by Martin Mareš's avatar Martin Mareš
Browse files

Strings: Suffix array by doubling

parent afeba22c
No related branches found
No related tags found
No related merge requests found
...@@ -191,6 +191,52 @@ So the total time spent in the while loops is also $\O(n)$. ...@@ -191,6 +191,52 @@ So the total time spent in the while loops is also $\O(n)$.
\subsection{Construction of the suffix array by doubling} \subsection{Construction of the suffix array by doubling}
TODO There is a~simple algorithm which builds the suffix array in $\O(n\log n)$ time.
As before, $\alpha$~will denote the input string and $n$~its length. Suffixes will
be represented by their starting position: $\alpha_i$~denotes the suffix $\alpha[i:{}]$.
The algorithm works in $\O(\log n)$ passes, which sort suffixes by their first~$k$
characters, where $k=2^0,2^1,2^2,\ldots$ For simplicity, we will index passes
by~$k$.
\defn{For any two strings $\gamma$ and~$\delta$, we define comparison of prefixes
of length~$k$:
$\gamma =_k \delta$ if $\gamma[{}:k] = \delta[{}:k]$,
$\gamma \le_k \delta$ if $\gamma[{}:k] \le \delta[{}:k]$.
}
The $k$-th pass will produce a~permutation~$S_k$ on suffix positions, which sorts
suffixes by~$\le_k$. We can easily compute the corresponding ranking array~$R_k$, but this time
we have to be careful to assign the same rank to suffixes which are equal by~$=_k$.
Formally, $R_k[i]$ is the number of suffixes~$\alpha_j$ such that $\alpha_j <_k \alpha_i$.
In the first pass, we sort suffixes by their first character. Since the alphabet
can be arbitrarily large, this might require a~general-purpose sorting algorithm,
so we reserve $\O(n\log n)$ time for this step. The same time obviously suffices
for construction of the ranking array.
In the $2k$-th pass, we get suffixes ordered by $\le_k$ and we want to sort them by $\le_{2k}$.
For any two suffixes $\alpha_i$ and~$\alpha_j$, the following holds by definition of lexicographic order:
$$\alpha_i \le_{2k} \alpha_j \Longleftrightarrow
(\alpha_i <_k \alpha_j) \lor
(\alpha_i =_k \alpha_j) \land (\alpha_{i+k} \le_k \alpha_{j+k}).
$$
Using the ranking function~$R_k$, we can write this as lexicographic comparison
of pairs $(R_k[i], R_k[i+k])$ and $(R_k[j], R_k[j+k])$.
We can therefore assign one such pair to each suffix and sort suffixes by these
pairs. Since any two pairs can be compared in constant time, a~general-purpose
sorting algorithm sorts them in $\O(n\log n)$ time. Afterwards, the ranking array
can be constructed in linear time by scanning the sorted order.
Overall, we have $\O(\log n)$ passes, each taking $\O(n\log n)$ time. The whole
algorithm therefore runs in $\O(n\log^2 n)$ time. In each pass, we need to store
only the input string~$\alpha$, the ranking array from the previous step, the suffix
array of the current step, and the encoded pairs. All this fits in $\O(n)$ space.
We can improve time complexity by using Bucketsort to sort the pairs. As the pairs
contain only numbers between 0 and~$n$, we can sort in two passes with $n$~buckets.
This takes $\O(n)$ time, so the whole algorithm runs in $\O(n\log n)$ time. Please
note that the first pass still remains $\O(n\log n)$, unless we can assume that the
alphabet is small enough to index buckets. Space complexity stays linear.
\endchapter \endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment