This chapter will be devoted to efficient conservation of history of data structures, so called \em{persistence}.
\section{The Notion of Persistence}
The original state of a typical data structures is lost when an operation is used to modify it.
Of course, usually there is no harm coming from it.
In certain situations however, we need to delve into the past.
Let us therefore form a hierarchy of data structures with respect to preservation of history.
\list{o}
\:\em{Ephemeral data structures} are the most widespread.
Past states are destroyed by "destructive" updates and irretrievably lost.
\:\em{Semi-persistent data structures} (sometimes partially persistent)
still support modifying operations only on the most recent version.
Updates produce a new version of the structure.
Historical versions form a linear order.
Past states can be reconstructed efficiently for reading.
\:\em{(Fully-)persistent data structures} directly extend their semi-persistent counterparts.
An update can be done on any version of the structure, resulting in a rooted tree of versions.
An update corresponds to adding a new child vertex to an existing vertex in the tree.
\:\em{(Purely) functional data structures} permit no changes to data once written.
Naturally, persistence is guaranteed under such an assumption,
as all versions appear as if just created by the last update.
This enforced immutability completely freezes past version in time.
This concept is intrinsic to functional languages such as Haskell,
but proves to be useful even in imperative programming.
Multi-threading is easier when there is confidence that existing objects will not be modified.
\endlist
\section{Basic Constructs}
We will explore several easy concepts, which we will later combine to reach optimal persistent pointer-based structure. Before that however, we notice that some data structures are persistent in their default implementation. Take stack for example.
\subsection{Persistent Stack}
We may implement stack as a forward-list, that is every item contains pointer to the next and the structure is represented by a pointer to the head of the forward-list. We can work with earlier versions by pointing to previous heads of the forward-list. Pushing new element would mean creating a new head pointing to head of the version chosen to precede newly this created version. Deleting means making head the element directly following head of the chosen version.
\subsection{Persistence through Path-Copying}
Let us now turn to binary search trees. Binary search tree can be converted into a fully-persistent one rather easily if space complexity is not a concern.
The straight-forward approach is called path-copying. It is based on the observation that most of the tree does not change during an update.
When a new version of the tree should by created by delete or insert, new copies are allocated only for vertices that are changed by the operation and their ancestors.
This typically means that only path from the inserted/deleted vertex to the root is newly allocated, plus constant number of other vertices.
The new vertices carry pointers to the old vertices where subtree rooted at such a vertex is not modified in any way.
Here we tacitly assume that only pointers to children are stored. Updating root in a tree with pointers to parents would involve creating new instances for all nodes in the tree.
With reasonable variants of binary search trees, achieved time complexity in a tree with $n$ vertices is $\Theta(\log n)$ per operation and $\Theta(\log n)$ memory for insert/delete.
The downside of this method is the increased space complexity. There is no apparent construction that would not increase memory complexity by copying the paths.
\subsection{Fat Nodes}
To limit memory consumption we may rather choose to let vertices carry their properties for all versions. This in reality means that we must store a collection of changes together with versions when those changes happened. This makes a vertex a kind of dictionary with keys being versions and values the descriptions of changes.
When semi-persistence is sufficient, upon arriving at a vertex asking for the state in version $A$, we do the following: We must go through the collection of changes to figure out what the state in version $A$ should be.
We start with the default values and apply all changes that happened earlier than $A$ in chronological order overwriting if there are multiple changes for the same field or pointer.
This process yields the correct state of the vertex for version $A$.
In fact, it might be easier to just copy all of the other fields the vertex possesses as well if one of them changes.
One change will therefore hold new values for all fields and pointers of the vertex.
For full-persistence we also need to resolve the issue of how to effectively determining which changes should be applied.
It will be addressed later through introduction of total ordering of all versions.
Looking at the memory consumption, it appears clear that the amount consumed stems from the total number of changes done by the balancing algorithm across all operations.
What remains to solve though is to choose the right kind of collection for versions of changes for one vertex. Surely using linked lists or arrays will inevitably lead to unsatisfactorily inefficient processing of applicable changes. Unfortunately as it turns out, even other data structures will let us down here.
With this approach, we can reach space complexity that is linear with the total number of changes in the tree. On the other hand execution time will be hurt. We can reach the following amortized complexity depending on the chosen data structure:
\section{Pointer Data Structures and Semi-Persistence}
Following up on the idea of fat vertices, let us limit their size to achieve identification of applicable version in constant time.
We will explain this technique only for semi-persistence first.
Full persistence requires the use of a few extra tricks and establishing an ordering on the versions.
A fat vertex stores a dictionary of standard vertices indexed by versions.
We call values of this dictionary slots.
The maximum size of this dictionary is set to a constant which is to be determined later.
Temporarily, we will allow the capacity of fat vertex to be exceeded.
This will have to be fixed however, before the ongoing operation finishes.
By placing a restriction on the size we may circumvent the increased complexity of search within one vertex.
Instead of copying the vertex, we simply add new slot into the dictionary.
Provided the maximum has not been exceeded yet, this insertion of a slot stops the propagation of changes toward the root.
The Reader should recall that this was the major weakness of path-copying.
Because of the limit on size of this dictionary, it may be implemented simply as a linked list.
Contents of one slot are a version handle, all pointers a vertex would have, then inverse pointers to fat vertices that have slots pointing to this fat vertex for this version and some fields, notably key and value as a bare minimum.
Not all fields need to be versioned.
For example balancing information may be stored only for the latest version, i.e. in red-black trees color is only used for balancing and is thus not needed to be persisted for old versions.
(Until full-persistence comes into play.)
One vertex in the original binary search then corresponds to a doubly-linked list of fat vertices.
When the vertex changes, new state of the vertex is written into a slot of the last fat vertex in the list.
As all slots become occupied in the last fat vertex and it needs to be modified, new vertex is allocated.
Modifications of a single vertex during one operation are all written to single slot, there is no need for using more slots.
When a new fat vertex $x$ is allocated, one of its slots is immediately taken.
Pointers must be updated in other fat vertices that pointed to the fat vertex preceding $x$ in the list.
This is done either by inserting new slot into them (copying all values from the latest slot and replacing pointers to the predecessor by pointers to $x$).
Or by directly updating the pointers if the right version is already present.
Recursive allocations may be triggered, which is not a problem if there is only a small amount of them.
This is ensured by setting the size of fat vertices suitably.
The order in which these allocations are executed can be arbitrary.
We can place an upper bound on the number of newly allocated fat vertices -- total number of vertices in the tree (including deleted vertices).
At most one new slot is occupied for every vertex in the tree.
To take advantage of fat vertices, we need the balancing algorithm to limit the number of vertices that change in one operation.
This was the goal of modified WAVL-balancing algorithm all along.
Furthermore, we need a limit on the number of pointers that can target one vertex at one time.
\theorem{
Suppose any binary search tree balancing algorithm satisfying the following properties:
\list{o}
\:There is a constant $k$ such that for any $n$ successive operations on initially empty tree, the number of vertex changes made to the tree is at most $kn$.
\:There is a constant bound on the number of pointers to any one vertex at any time.
\endlist
Then this algorithm with the addition of fat vertices for semi-persistence, consumes $\O(n)$ space for the entire history of $n$ operations starting from an empty tree.
}
\proof
We denote the number of pointer fields per vertex as $p$ and maximum number of vertices pointing to one vertex at a time as $x$. We then define the number of slots in fat vertex as $s = p + x +1$. (Higher value is also possible)\\
We define the potential of the structure as the total number of occupied slots in all fat vertices that are the last in their doubly-linked list. (Thus initially zero.) Allocation of a new fat vertex will cost one unit of energy. This cost can be charged from the potential or the operation. We will show that the operation needs to be charged only a constant amount of energy per one vertex modification by the original algorithm (to compensate increase in potential or pay for allocations), from which the proposition follows.\\
For insert, a new vertex is created (which increases potential by a constant). During rebalancing of the tree, $r$ vertices are to be modified. Let us consider this sequence of modifications one by one. We send one floating unit of energy to each of the $r$ vertices.\\
If a modification of $v$ is second or later modification of $v$ during this operation, changes are simply written to the slot for this version.\\
Otherwise, number of used slots is checked. If there is one or more empty available, new slot is used, taking default values of fields from preceding slot. This increases potential by one, which is covered by the floating unit of energy.\\
If no slots are available, new fat vertex $v'$ is allocated and one of its slots is used. This step triggers a decrease in the potential by $p+x$. The floating unit of energy is used to pay for allocation of the new vertex. Next, fat vertices to corresponding current version interval of vertices having pointers to $v$ need to have this reflected. Additionally inverse pointers to $v'$ need to be set. These are at most $p+x$ changes to other vertices that may use their new empty slots. The decrease in potential is used to send one unit of energy to every such vertex that need an update. Changes are executed recursively and will not require extra energy to be charged on this modification.
\qed
Regarding the time complexity, searching the correct slot in a fat vertex produces only constant overhead. Realizing that every operation can be charged on a memory allocation of a fat vertex such that there is a constant $c$ depending only on the balancing algorithm such that for every allocated fat vertex the number of operations charged on it is at most $c$.
Assuming the conditions from the previous proposition, the cost to write changes into fat vertices is amortized to $\O(1)$ per operation.
\section{Point Localization in a plane}
Given a bounded connected subset of a plane partitioned into a finite set of faces, the goal is to respond to queries asking to which face a point $P$ belongs. We limit ourselves to polygonal faces.
One special case of particular importance of this problem is finding the closest point, i.e. when the faces are Voronoi diagrams.
To build the data structure we will follow a general idea of line-sweeping. We start by sorting vertices of all faces by the first coordinate. We will continually process these vertices in the order of increasing first coordinate. We maintain a sorted list of edges (in a semi-persistent BST) during this processing. The list contains edges that intersect with a sweeping-line parallel to the secondary axis in order of the intersections (sorted by the second coordinate).
Initially the list is empty and we set the sweeping line to intersect the first vertex.
When we start processing a new vertex we imagine moving the sweeping line along the primary axis to intersect with the new vertex. We can easily observe that the order of edges cannot change during this virtual movement. (None of the edges can intersect except in a vertex.) It will happen, however, that edges must be either removed from the list or added to the list. When adding
We store pointers to the versions created in each in vertex in a sorted array.
The number of vertices will be denoted $n$, the number of edges is thus bounded by $3n$. This follows from the partitioning being drawing of a planar graph. Complexity is therefore $\O(n \log n)$ for pre-processing and $\O(\log n)$ for one query.
\section{A Sketch of Full-Persistence}
\section{List Ordering}
Moving from semi-persistence to full-persistence we encounter an obstacle -- versions no longer form implicit linear order. (By versions we mean states of the tree in between updates. We will also use some auxiliary versions not directly mappable to any such state.) Nonetheless, to work with fat vertices, we need to be able to determine slots that carry values correct for current version. To achieve this, we need to identify an interval of versions the current vesion would fall into. For this purpose we will try to introduce an ordering to versions of the persistent data structure.
Version do form a rooted tree with the original version in the root. We can also order children of every vertex by the order of their creation (latest creation first). We then define the desired ordering as the order of vertices corresponding to versions as they are visited by an in-order traversal.
In reality, we will also insert other elements into the ordering that will be helper version and will not be used to represent the state of the entire structure after a sequence of operations. This can be disregarded for now.
With the ordering defined, we still need a way to efficiently represent it in memory. The operations we really need are two:
\list{o}
\:{\tt InsertSuccessor(Version)} -- this operation will insert a new version between {\tt Version} and its successor (if any). The newly created version is returned.
\:{\tt Compare(VersionA, VersionB)} -- returns 1, $-1$, or 0 indicating whether {\tt VersionA} precedes, succeeds, or equals {\tt VersionB}.
\endlist
We will strive to find a way to assign an integer to each version, these integers will be comparable in constant time. This assignment problem is called \em{list-labeling} and there are several options to tackling it.
The straight-forward idea would be to assign 0 to the first version and $2^m$ to an artificial upper bound. All newer versions will be assigned the arithmetic average of its successor and predecessor. Now, we see that if $v =(p + s)/2$ and $2^k \mid p, s$, then $2^{k-1}\mid v$. The first two integers are divisible by $2^m$ guaranteeing capacity $\Omega(m)$.
Let us denote the total number of updates to the persistent tree as $n$. It would be reasonable to assume that arithmetic operations on non-negative integers less than or equal to $n$ can be done in constant time. We are therefore permitted to create only $\O(\log n)$ versions using the straight-forward idea if constant time per operation is needed. This is not sufficient.
To preserve the speed of semi-persistence, {\tt Compare} must be $\O(1)$ and {\tt InsertSuccessor}$\O(\log n)$ at least amortized. Weight-balanced trees that were introduced in the previous chapter are ideally suited for this purpose.
For every vertex in the weight-balanced tree, we will store encoding of the path from root to it in form of sequence of 0s and 1s. 1 for right child, 0 for left child. We know that depth of the tree must be logarithmic, compared integers are composed of $\O(\log n)$ bits, so comparison of versions is efficient. Inserting a successor version is also simple. Rebuilding of some subtree will not change order of versions as all path encodings will be recalculated.