Strings: Suffix array by doubling

badfe5b3 · Martin Mareš · afeba22c · badfe5b3
Commit badfe5b3 authored May 26, 2019 by Martin Mareš
--- a/08-string/string.tex
+++ b/08-string/string.tex
@@ -191,6 +191,52 @@ So the total time spent in the while loops is also $\O(n)$.
 \subsection{Construction of the suffix array by doubling}
-TODO
+There is a~simple algorithm which builds the suffix array in $\O(n\log n)$ time.
+As before, $\alpha$~will denote the input string and $n$~its length. Suffixes will
+be represented by their starting position: $\alpha_i$~denotes the suffix $\alpha[i:{}]$.
+The algorithm works in $\O(\log n)$ passes, which sort suffixes by their first~$k$
+characters, where $k=2^0,2^1,2^2,\ldots$ For simplicity, we will index passes
+by~$k$.
+\defn{For any two strings $\gamma$ and~$\delta$, we define comparison of prefixes
+of length~$k$:
+$\gamma =_k \delta$ if $\gamma[{}:k] = \delta[{}:k]$,
+$\gamma \le_k \delta$ if $\gamma[{}:k] \le \delta[{}:k]$.
+}
+The $k$-th pass will produce a~permutation~$S_k$ on suffix positions, which sorts
+suffixes by~$\le_k$. We can easily compute the corresponding ranking array~$R_k$, but this time
+we have to be careful to assign the same rank to suffixes which are equal by~$=_k$.
+Formally, $R_k[i]$ is the number of suffixes~$\alpha_j$ such that $\alpha_j <_k \alpha_i$.
+In the first pass, we sort suffixes by their first character. Since the alphabet
+can be arbitrarily large, this might require a~general-purpose sorting algorithm,
+so we reserve $\O(n\log n)$ time for this step. The same time obviously suffices
+for construction of the ranking array.
+In the $2k$-th pass, we get suffixes ordered by $\le_k$ and we want to sort them by $\le_{2k}$.
+For any two suffixes $\alpha_i$ and~$\alpha_j$, the following holds by definition of lexicographic order:
+$$\alpha_i \le_{2k} \alpha_j \Longleftrightarrow
+(\alpha_i <_k \alpha_j) \lor
+(\alpha_i =_k \alpha_j) \land (\alpha_{i+k} \le_k \alpha_{j+k}).
+$$
+Using the ranking function~$R_k$, we can write this as lexicographic comparison
+of pairs $(R_k[i], R_k[i+k])$ and $(R_k[j], R_k[j+k])$.
+We can therefore assign one such pair to each suffix and sort suffixes by these
+pairs. Since any two pairs can be compared in constant time, a~general-purpose
+sorting algorithm sorts them in $\O(n\log n)$ time. Afterwards, the ranking array
+can be constructed in linear time by scanning the sorted order.
+Overall, we have $\O(\log n)$ passes, each taking $\O(n\log n)$ time. The whole
+algorithm therefore runs in $\O(n\log^2 n)$ time. In each pass, we need to store
+only the input string~$\alpha$, the ranking array from the previous step, the suffix
+array of the current step, and the encoded pairs. All this fits in $\O(n)$ space.
+We can improve time complexity by using Bucketsort to sort the pairs. As the pairs
+contain only numbers between 0 and~$n$, we can sort in two passes with $n$~buckets.
+This takes $\O(n)$ time, so the whole algorithm runs in $\O(n\log n)$ time. Please
+note that the first pass still remains $\O(n\log n)$, unless we can assume that the
+alphabet is small enough to index buckets. Space complexity stays linear.
 \endchapter