Skip to content
Snippets Groups Projects
Commit afeba22c authored by Martin Mareš's avatar Martin Mareš
Browse files

Strings: Here we begin...

parent f3ff9446
No related branches found
No related tags found
No related merge requests found
...@@ -37,7 +37,7 @@ query. ...@@ -37,7 +37,7 @@ query.
In this chapter, we will consider the simple case of range queries In this chapter, we will consider the simple case of range queries
on points in~$\R^d$ on points in~$\R^d$
\section{Range queries in one dimension} \section[range1d]{Range queries in one dimension}
We show the basic techniques on one-dimensional range queries. The simplest We show the basic techniques on one-dimensional range queries. The simplest
solution is to use a~sorted array. Whenever we are given a~query interval $[a,b]$, solution is to use a~sorted array. Whenever we are given a~query interval $[a,b]$,
... ...
......
TOP=..
include ../Makerules
\ifx\chapter\undefined
\input adsmac.tex
\singlechapter{8}
\fi
\chapter[string]{Strings}
\nota{
\tightlist{o}
\:$\Sigma$ is an \em{alphabet} -- a~finite set of \em{characters}.
\:$\Sigma^*$ is the set of all \em{strings} (finite sequences) over~$\Sigma$.
\:We will use Greek letters for string variables, Latin letters for character and numeric variables,
and {\tt typewriter} letters for concrete characters. We will make no difference between
a~character and a~single-character string.
\:$|\alpha|$ is the \em{length} of the string~$\alpha$.
\:$\varepsilon$ is the \em{empty string} --- the only string of length~0.
\:$\alpha\beta$ is the \em{concatenation} of strings $\alpha$ and~$\beta$;
we have $\alpha\varepsilon = \varepsilon\alpha = \alpha$ for all~$\alpha$.
\:$\alpha[i]$ is the $i$-th character of the string~$\alpha$; characters
are indexed starting with~0.
\:$\alpha[i:j]$ is the \em{substring} $\alpha[i]\alpha[i+1]\ldots\alpha[j-1]$;
note that $\alpha[j]$ is the first character \em{behind} the substring,
so we have $|\alpha[i:j]| = j-i$. If $i>j$, the substring is empty.
Either $i$ or~$j$ can be omitted, the beginning or the end of~$\alpha$
is used instead.
\:$\alpha[{}:j]$ is the \em{prefix} of~$\alpha$ formed by the first~$j$ characters.
A~word of length~$n$ has $n+1$ prefixes, one of them being the empty string.
\:$\alpha[i:{}]$ is the \em{suffix} of~$\alpha$ from character number~$i$ to the end.
A~word of length~$n$ has $n+1$ suffixes, one of them being the empty string.
\:$\alpha[{}:{}] = \alpha$.
\:$\alpha \le \beta$ denotes \em{lexicographic order} of strings:
$\alpha \le \beta$ if $\alpha$ is a~prefix of~$\beta$ or if there exists~$k$
such that $\alpha[k] < \beta[k]$ and $\alpha[{}:k] = \beta[{}:k]$.
\endlist
}
\section{Suffix arrays}
\texfigure{
\vbox{\halign{~~\hfil#~~&\hfil#~~&\hfil#~~&\hfil#~~~~&\tt#\hfil~~\cr
$i$ & $S[i]$ & $R[i]$ & $L[i]$ & \omit\em{suffix}\hfil \cr
\noalign{\smallskip\hrule\smallskip\smallskip}
0 & 14 & 3 & 0 & $\varepsilon$ \cr
1 & 8 & 11 & 3 & ananas \cr
2 & 10 & 10 & 2 & anas \cr
3 & 0 & 7 & 2 & annbansbananas \cr
4 & 4 & 4 & 1 & ansbananas \cr
5 & 12 & 12 & 0 & as \cr
6 & 7 & 14 & 3 & bananas \cr
7 & 3 & 6 & 0 & bansbananas \cr
8 & 9 & 1 & 2 & nanas \cr
9 & 11 & 8 & 1 & nas \cr
10 & 2 & 2 & 1 & nbansbananas \cr
11 & 1 & 9 & 1 & nnbansbananas \cr
12 & 5 & 5 & 0 & nsbananas \cr
13 & 13 & 13 & 1 & s \cr
14 & 6 & 0 & 0 & sbananas \cr
}}
}{Suffixes of \|annbansbananas| and the arrays $S$, $R$, and~$L$}
\defn{The \df{suffix array} for a~string~$\alpha$ of length~$n$ is a~permutation~$S$
of the set $\{0,\ldots,n\}$ such that $\alpha[S[i]:{}] < \alpha[S[i+1]:{}]$ for all
$0\le i< n$.
}
\claim{The suffix array can be constructed in time $\O(n)$.}
Once we have the suffix array for a~string~$\alpha$, we can easily locate all occurrences
of a~given substring~$\beta$ in~$\alpha$. Each occurrence corresponds to a~suffix of~$\alpha$
whose prefix is~$\beta$. In the lexicographic order of all suffixes, these suffixes form a~range.
We can easily find the start and end of this range using binary search on the suffix array.
We need $\O(\log |\alpha|)$ steps, each step involves string comparison with~$\alpha$, which
takes $\O(|\beta|)$ time in the worst case. This makes $\O(|\beta| \log|\alpha|)$ total.
\corr{Using the suffix array for~$\alpha$, we can enumerate all occurrences of a~substring~$\beta$
in time $\O(|\beta| \log |\alpha| + p)$, where $p$~is the number of occurrences reported. Only
counting the occurrences costs $\O(|\beta| \log |\alpha|)$ time.
}
\note{With further precomputation, time complexity of searching can be improved to $\O(|\beta| + \log |\alpha|)$.
}
\defn{The \df{rank array}~$R[0\ldots n]$ is the inverse permutation of~$S$. That is, $R[i]$ tells how many
suffixes of~$\alpha$ are lexicographically smaller than $\alpha[i:{}]$.
}
\note{The rank array can be trivially computed from the suffix array in time $\O(n)$.}
\defn{The \df{LCP array}~$L[0\ldots n-1]$ stores the length of the \df{longest common prefix}
of each suffix and its lexicographic successor. That is, $L[i] = \LCP(\alpha[S[i]:{}], \alpha[S[i+1]:{}])$,
where $\LCP(\gamma,\delta)$ is the maximum~$k$ such that $\gamma[{}:k] = \delta[{}:k]$.
}
\claim{Given the suffix array, the LCP array can be constructed in time $\O(n)$.}
\obs{The LCP array can be easily used to find the longest common prefix of any two
suffixes $\alpha[i:{}]$ and $\alpha[j:{}]$. We use the rank array to locate them
in the lexicographic order of all suffixes: they lie at positions
$i' = R[i]$ and $j' = R[j]$. Then we compute $k = \min(L[i'], L[i'+1], \ldots, L[j'-1])$.
We claim that $\LCP(\alpha[i:{}], \alpha[j:{}])$ is exactly~$k$.
First, each pair of adjacent suffixes in the range $[i',j']$ has a~common prefix of
length at least~$k$, so our LCP is at least~$k$. However, it cannot be more:
we have $k = L[\ell]$ for some~$\ell \in [i',j'-1]$, so the $\ell$-th and $(\ell+1)$-th suffix
differ at position $k+1$. Since all suffixes in the range share the first~$k$ characters,
their $(k+1)$-th characters must be non-decreasing. This means that the $(k+1)$-th character
of the first and the last suffix in the range must differ, too.
}
This suggests building a~\em{Range Minimum Query} (RMQ) data structure for the array~$L$:
it is a~static data structure, which can answer queries for the position of the minimum
element in a~given range of indices. One example of a~RMQ structure is the 1-dimensional
range tree from section \secref{range1d}: it can be built in time $\O(n)$ and it answers
queries in time $\O(\log n)$. There exists a~better structure with build time $O(n)$
and query time $\O(1)$.
\examples{The arrays we have defined can be used to solve the following problems in linear time:
\list{o}
\:\em{Histogram of $k$-grams:} we want to count occurrences of every substring of length~$k$.
Occurrences of every $k$-gram correspond to ranges of suffixes in their lexicographic order.
These ranges can be easily identified, because we have $L[\ldots] < k$ at their boundaries.
We only have to be careful about suffixes shorter than~$k$, which contain no $k$-gram.
\:\em{The longest repeating substring of a~string~$\alpha$:} Consider two positions~$i$ and~$j$ in~$\alpha$.
The length of the longest common substring starting at these positions is equal to the LCP
of the suffixes $\alpha[i:{}]$ and $\alpha[j:{}]$, which is a~minimum over some range in~$L$.
So it is always equal to some value in~$L$. It is therefore sufficient to consider only pairs
of suffixes adjacent in the lexicographic order, that is to find the maximum value in~$L$.
\:\em{The longest common substring of two strings $\alpha$ and~$\beta$:} We build a~suffix
array and LCP array for the string $\alpha$\|\#|$\beta$, using a~separator~\|\#| which occurs
in neither $\alpha$ nor~$\beta$. We observe that each suffix of $\alpha$\|\#|$\beta$ corresponds
to a~suffix of either~$\alpha$ or~$\beta$. Like in the previous problem, we want to find a~pair
of positions~$i$ and~$j$ such that the LCP of the $i$-th and $j$-th suffix is maximized. We however
need one $i$ and~$j$ to come from~$\alpha$ and the other from~$\beta$. Therefore we find the
maximum $L[k]$ such that $S[k]$ comes from~$\alpha$ and $S[k+1]$ from~$\beta$ or vice versa.
\endlist
}
\subsection{Construction of the LCP array: Kasai's algorithm}
We show an~algorithm which constructs the LCP array~$L$ in linear time,
given the suffix array~$S$ and the rank array~$R$.
We will use~$\alpha_i$ to denote the $i$-th suffix of~$\alpha$ in lexicographic
order, that is $\alpha[S[i]:{}]$.
We can easily compute all $L[i]$ explicitly: for each~$i$, we compare the
suffixes $\alpha_i$ and~$\alpha_{i+1}$ character-by-character from the start
and stop at the first difference. This is obviously correct, but slow. We will
however show that most of these comparisons are redundant.
Consider two suffixes $\alpha_i$ and~$\alpha_{i+1}$ adjacent in lexicographic order.
Suppose that their LCP $k = L[i]$ is non-zero.
Then $\alpha_i[1:{}]$ and $\alpha_{i+1}[1:{}]$ are also suffixes of~$\alpha$,
equal to $\alpha_{i'}$ and~$\alpha_{j'}$ for some $i' < j'$.
Obviously, $\LCP(\alpha_{i'},\alpha_{j'}) = \LCP(\alpha_i, \alpha_{i+1}) - 1 = k - 1$.
However, this LCP is a~minimum of the range $[i',j']$ in the array~$L$,
so we must have $L[i'] \ge k - 1$.
This allows us to process suffixes of~$\alpha$ from the longest to the shortest one,
always obtaining the next suffix by cutting off the first character of the previous suffix.
We calculate the~$L$ of the next suffix by starting with $L$~of the previous suffix minus one
and comparing characters from that position on:
\algo{BuildLCP}
\algin A~string~$\alpha$ of length~$n$, its suffix array~$S$ and rank array~$R$
\:k \= 0 \cmt{The LCP computed in previous step}
\:For $p=0,\ldots,n-1$: \cmt{Start of the current suffix in~$\alpha$}
\::[blcpdec]$k \= \max(k-1, 0)$ \cmt{The next LCP is at least $\hbox{previous}-1$}
\::$i \= R[p]$ \cmt{Index of current suffix in sorted order}
\::$q \= S[i+1]$ \cmt{Start of the lexicographically next suffix in~$\alpha$}
\::While $(p+k < n) \land (q+k < n) \land (\alpha[p+k] = \alpha[q+k])$:
\:::[blcpinc]$k \= k+1$ \cmt{Increase $k$ while characters match}
\::$L[i] = k$ \cmt{Record LCP in the array~$L$}
\algout LCP array~$L$
\endalgo
\lemma{The algorithm \alg{BuildLCP} runs in time $\O(n)$.}
\proof
All operations outside the while loop take $\O(n)$ trivially.
We will amortize time spent in the while loop using~$k$ as a~potential.
The value of~$k$ always lies in $[0,n]$ and it starts at~0.
It always changes by~1: it can be decreased only in step~\itemref{blcpdec}
and increased only in step~\itemref{blcpinc}. Since there are at most~$n$
decreases, there can be at most $2n$ increases before $k$~exceeds~$n$.
So the total time spent in the while loops is also $\O(n)$.
\qed
\subsection{Construction of the suffix array by doubling}
TODO
\endchapter
...@@ -270,6 +270,7 @@ ...@@ -270,6 +270,7 @@
% Various operators and functions % Various operators and functions
\def\intr{{\rm int}} \def\intr{{\rm int}}
\def\key{{\rm key}} \def\key{{\rm key}}
\def\LCP{{\rm LCP}}
% Tabulka operací datové struktury % Tabulka operací datové struktury
\def\optable#1{$$ \def\optable#1{$$
... ...
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment