Improve semi-persistence

5830f082 · Jiri Skrobanek · 928bd7d7 · 5830f082
Commit 5830f082 authored 4 years ago by Jiri Skrobanek
--- a/201-persist/persist.tex
+++ b/201-persist/persist.tex
@@ -57,13 +57,17 @@ Here we tacitly assume that only pointers to children are stored. Updating root
 This method yields a functional data structure since the old version of the tree was technically not modified and can still be used.
 With reasonable variants of binary search trees, achieved time complexity in a tree with $n$ vertices is $\Theta(\log n)$ per operation and $\Theta(\log n)$ memory for insert/delete.

-The downside of this method is the increased space complexity. There is no apparent construction that would not increase memory complexity by copying the paths.
+The downside of this method is the increased space complexity. 
+There is no apparent construction that would not increase memory complexity by copying the paths.

 %TODO: Figure

+This outlined method is not exclusive to binary search trees. 
+It may be used to obtain functional variants of some variants pointer data structures of which binary search trees are a prime example.
+
 \subsection{Fat Nodes}

-To limit memory consumption we may rather choose to let vertices carry their properties for all versions. This in reality means that we must store a collection of changes together with versions when those changes happened. This makes a vertex some kind of a dictionary with keys being versions and values the descriptions of changes.
+To limit memory consumption over path-copying we may rather choose to let vertices carry their properties for all versions. This in reality means that we must store a collection of changes together with versions when those changes happened. This makes a vertex some kind of a dictionary with keys being versions and values the descriptions of changes.

 When semi-persistence is sufficient, upon arriving at a vertex asking for the state in version $A$, we do the following: We must go through the collection of changes to figure out what the state in version $A$ should be. 
 We start with the default values and apply all changes that happened earlier than $A$ in chronological order overwriting if there are multiple changes for the same field or pointer. 
@@ -85,11 +89,11 @@ With this approach, we can reach space complexity that is linear with the total

 \section{Pointer Data Structures and Semi-Persistence}

-Following up on the idea of fat vertices, let us limit their size to achieve identification of applicable version in constant time. 
+Following up on the idea of fat nodes, let us limit their size to achieve identification of applicable version in constant time. 
 We will explain this technique for semi-persistence only first. 
 Full persistence is more complicated and requires a few extra tricks.

-A fat vertex stores a dictionary of standard vertices indexed by versions. 
+A fat node stores a dictionary of standard vertices indexed by versions. 
 We call values of this dictionary slots. 
 The maximum size of this dictionary is set to a constant which is to be determined later. 
 Temporarily, we will allow the capacity of fat vertex to be exceeded. 
@@ -105,26 +109,27 @@ Not all fields need to be versioned.
 For example balancing information may be stored only for the latest version, i.e. in red-black trees color is only used for balancing and is thus not needed to be persisted for old versions. 
 (Until full-persistence comes into play.)

-One vertex in the original binary search then corresponds to a doubly-linked list of fat vertices.
-When the vertex changes, new state of the vertex is written into a slot of the last fat vertex in the list. 
-As all slots become occupied in the last fat vertex and it needs to be modified, new vertex is allocated.
+One vertex in the original binary search then corresponds to a doubly-linked list of fat nodes.
+When the vertex changes, new state of the vertex is written into a slot of the last fat node in the list. 
+As all slots become occupied in the last fat vertex and it needs to be modified, new fat node is allocated.

 Modifications of a single vertex during one operation are all written to single slot, there is no need for using more slots.

-When a new fat vertex $x$ is allocated, one of its slots is immediately taken. 
-Pointers must be updated in other fat vertices that pointed to the fat vertex preceding $x$ in the list. 
+When a new fat node $x$ is allocated, one of its slots is immediately taken. 
+Pointers must be updated in other fat nodes that pointed to the fat node preceding $x$ in the list. 
 This is done either by inserting new slot into them (copying all values from the latest slot and replacing pointers to the predecessor by pointers to $x$). 
 Or by directly updating the pointers if the right version is already present. 

 Recursive allocations may be triggered, which is not a problem if there is only a small amount of them. 
 This is ensured by setting the size of fat vertices suitably. 
 The order in which these allocations are executed can be arbitrary.
+Regardless of the order chosen, this process of allocations is finite. 
 We can place an upper bound on the number of newly allocated fat vertices -- total number of vertices in the tree (including deleted vertices). 
 At most one new slot is occupied for every vertex in the tree.

-To take advantage of fat vertices, we need the balancing algorithm to limit the number of vertices that change in one operation. 
-This was the goal of modified WAVL-balancing algorithm all along. 
-Furthermore, we need a limit on the number of pointers that can target one vertex at one time.
+To take advantage of fat vertices, we need the balancing algorithm to limit the number of vertices that change in one operation at least in the average case. 
+Several data structures have these properties, AVL trees for example.
+Furthermore, we need a limit on the number of pointers that can target one vertex at one time. Otherwise complexity would suffer.

 \theorem{
 Suppose any binary search tree balancing algorithm satisfying the following properties:
@@ -136,15 +141,42 @@ Then this algorithm with the addition of fat vertices for semi-persistence, cons
 }

 \proof
-We denote the number of pointer fields per vertex as $p$ and maximum number of vertices pointing to one vertex at a time as $x$. We then define the number of slots in fat vertex as $s = p + x + 1$. (Higher value is also possible)\\
-We define the potential of the structure as the total number of occupied slots in all fat vertices that are the last in their doubly-linked list. (Thus initially zero.) Allocation of a new fat vertex will cost one unit of energy. This cost can be charged from the potential or the operation. We will show that the operation needs to be charged only a constant amount of energy per one vertex modification by the original algorithm (to compensate increase in potential or pay for allocations), from which the proposition follows.\\
-For insert, a new vertex is created (which increases potential by a constant). During rebalancing of the tree, $r$ vertices are to be modified. Let us consider this sequence of modifications one by one. We send one floating unit of energy to each of the $r$ vertices.\\
-If a modification of $v$ is second or later modification of $v$ during this operation, changes are simply written to the slot for this version.\\
-Otherwise, number of used slots is checked. If there is one or more empty available, new slot is used, taking default values of fields from preceding slot. This increases potential by one, which is covered by the floating unit of energy.\\
-If no slots are available, new fat vertex $v'$ is allocated and one of its slots is used. This step triggers a decrease in the potential by $p+x$. The floating unit of energy is used to pay for allocation of the new vertex. Next, fat vertices to corresponding current version interval of vertices having pointers to $v$ need to have this reflected. Additionally inverse pointers to $v'$ need to be set. These are at most $p+x$ changes to other vertices that may use their new empty slots. The decrease in potential is used to send one unit of energy to every such vertex that need an update. Changes are executed recursively and will not require extra energy to be charged on this modification.
+We denote the number of pointer fields per vertex as $p$ and maximum number of vertices pointing to one vertex at a time as $k$. 
+We then define the number of slots in fat node as $s = p + k + 1$.
+
+We define the potential of the structure as the total number of occupied slots in all fat vertices that are the last in their doubly-linked list. 
+(Thus initially zero.) 
+Allocation of a new fat vertex will cost one unit of energy. 
+This cost can be paid from the potential or the operation. 
+We will show that the operation needs to be charged only a constant amount of energy per one vertex modification by the original algorithm (to compensate increase in potential or pay for allocations), from which the proposition follows.
+
+For insert, a new fat node is created with one occupied slot (which increases potential by a constant). 
+This increase is paid for by the operation. During rebalancing of the tree, $r$ vertices are to be modified. 
+Let us consider this sequence of modifications one by one. 
+The Operation sends one floating unit of energy to each of the $r$ vertices.
+
+If a modification of $v$ is second or later modification of $v$ during this operation, changes are simply written to the slot for this version. 
+Otherwise, number of used slots is checked. 
+If there is one or more empty available, new slot is used. 
+New slot takes values of unchanged fields and pointers from the preceding slot. 
+This increases potential by one, which is covered by the floating unit of energy.
+
+If no slots are available, new fat vertex $v'$ is allocated and one of its slots is used. 
+This step triggers a decrease in the potential by $p+k$. 
+The floating unit of energy is used to pay for allocation of the new vertex. 
+Next, fat vertices to corresponding current version interval of vertices having pointers to $v$ need to have this reflected. 
+Additionally inverse pointers to $v'$ need to be set. 
+These are at most $p+k$ changes to other vertices that may use their new empty slots. 
+The decrease in potential is used to send one unit of energy to every such vertex that need an update. 
+Changes are executed recursively and will not require extra energy to be charged on this modification.
+
+Operation is charged only constant amount of work for every change it does.
+Space complexity consumed is bounded by the number changes done by update operations. Thus space complexity is $\O(n)$. 
 \qed

-Regarding the time complexity, searching the correct slot in a fat vertex produces only constant overhead. Realizing that every operation can be charged on a memory allocation of a fat vertex such that there is a constant $c$ depending only on the balancing algorithm such that for every allocated fat vertex the number of operations charged on it is at most $c$.
+Regarding the time complexity, searching the correct slot in a fat vertex produces only constant overhead. 
+It is easy to realize that every operation done during invariant restoration can be charged on a memory allocation of a fat vertex. 
+Also there exists a positive constant $c$, depending only on the balancing algorithm, such that for every allocated fat vertex the number of operations charged on it is at most $c$.

 Assuming the conditions from the previous proposition, the cost to write changes into fat vertices is amortized to $\O(1)$ per operation.

@@ -157,7 +189,7 @@ One special case of particular importance of this problem is finding the closest
 To build the data structure we will follow a general idea of line-sweeping. 
 We start by sorting vertices of all faces by the first coordinate $x$. 
 We will continually process these vertices in the order of increasing coordinate $x$. 
-We maintain $S$, a sorted list of edges (in a semi-persistent BST) during this processing. 
+We maintain $S$, a sorted list of edges (in a semi-persistent binary search tree) during this processing. 
 The list $S$ contains edges that intersect with a sweeping line parallel to the secondary axis $y$ in order of the intersections (sorted by the second coordinate $y$).

 Initially the list is empty and we set the sweeping line to intersect the first vertex.
@@ -178,6 +210,8 @@ The number of vertices will be denoted $n$, the number of edges is thus bounded
 This follows from the partitioning being drawing of a planar graph. 
 Complexity is therefore $\O(n \log n)$ for pre-processing and $\O(\log n)$ for one query.

+%TODO: Figure
+
 \section{A Sketch of Full-Persistence}

 It is possible to obtain fully-persistent binary search trees with $\O(\log n)$ time complexity per operation if $n$ is the total number of updates. Couple of obstacles must be overcome however. 
@@ -263,4 +297,10 @@ Of course this must be paid for every level of our conceptual tree, bringing the
 Similarly, if expansion of the array is needed, this can be amortized through paying a constant for every insert.
 \qed

+\exercises
+
+\ex{Consider what properties of (semi-)persistent binary search trees change when capacity of fat vertices is increased above the value used in the proof.}
+
+\endexercises
+
 \endchapter