Integrate some observations from thesis

1c2faa0d · Jiri Skrobanek · d398f556 · 1c2faa0d
Commit 1c2faa0d authored 4 years ago by Jiri Skrobanek
--- a/201-persist/persist.tex
+++ b/201-persist/persist.tex
@@ -41,18 +41,23 @@ We will explore several easy concepts, which we will later combine to reach opti

 \subsection{Persistent Stack}

-We may implement stack as a forward-list. In that case every item contains pointer to the next and the structure is represented by a pointer to the head of the forward-list. This implementation is naturally persistent. We can work with earlier versions by pointing to previous heads of the forward-list. Pushing new element means creating a new head pointing to head of the version chosen to precede newly this created version. Deleting means making head the element directly following head of the chosen version.
+We may implement stack as a forward-list. 
+In that case every item contains a pointer to the next and the structure is represented by a pointer to the head of the forward-list. 
+This implementation is naturally persistent. We can work with earlier versions by pointing to previous heads of the forward-list. 
+Pushing new element means creating a new head pointing to head of the version chosen to precede newly this created version. 
+Deleting means setting the element which directly follows head of the chosen version as the new head.

 \subsection{Persistence through Path-Copying}

 Let us now turn to binary search trees. Binary search tree can be converted into a fully-persistent one rather easily if space complexity is not a concern.
 One straight-forward approach to achieve this is called \em{path-copying}. It is based on the observation that most of the tree does not change during an update.

-When a new version of the tree should by created by delete or insert, new copies are allocated only for vertices that are changed by the operation and their ancestors. 
-This typically means that only path from the inserted/deleted vertex to the root is newly allocated, plus constant number of other vertices close to the path (due to rebalancing). 
-Pointers to children in new vertices are also set to the new versions of those children. 
-For children where new versions do not exist, (for those children that are roots of subtrees which were not modified during the update in any way) pointers to the old versions are used.
-Here we tacitly assume that only pointers to children are stored. Updating root in a tree with pointers to parents would involve creating new instances for all nodes in the tree.
+When a new version of the tree should by created by delete or insert, copies are created for vertices that are to be hanged by the operation and their ancestors. 
+This typically means that only path from the inserted/deleted vertex to the root is duplicated, plus constant number of other vertices close to the path (due to rebalancing). 
+The changes are only applied to the copies.
+Pointers in new vertices are updated to point to new copies of corresponding vertices where those copies were created.
+The root of the newly created version is the copy of the original root. Thus the new version now shares some vertices with the old version.
+Here we tacitly assume that only pointers to children are stored. Updating the root in a tree with pointers to parents would involve creating copies for all nodes in the tree.

 This method yields a functional data structure since the old version of the tree was technically not modified and can still be used.
 With reasonable variants of binary search trees, achieved time complexity in a tree with $n$ vertices is $\Theta(\log n)$ per operation and $\Theta(\log n)$ memory for insert/delete.
@@ -63,11 +68,13 @@ There is no apparent construction that would not increase memory complexity by c
 %TODO: Figure

 This outlined method is not exclusive to binary search trees. 
-It may be used to obtain functional variants of some variants pointer data structures of which binary search trees are a prime example.
+It may be used to obtain functional variants of more variants of pointer data structures, of which binary search trees are a prime example.

 \subsection{Fat Nodes}

-To limit memory consumption over path-copying we may rather choose to let vertices carry their properties for all versions. This in reality means that we must store a collection of changes together with versions when those changes happened. This makes a vertex some kind of a dictionary with keys being versions and values the descriptions of changes.
+To limit memory consumption over path-copying we may rather choose to let vertices carry their properties for all versions. 
+This in reality means that we must store a collection of changes together with versions when those changes happened. 
+This makes a vertex alike a dictionary with keys being versions and values being the descriptions of changes.

 When semi-persistence is sufficient, upon arriving at a vertex asking for the state in version $A$, we do the following: We must go through the collection of changes to figure out what the state in version $A$ should be. 
 We start with the default values and apply all changes that happened earlier than $A$ in chronological order overwriting if there are multiple changes for the same field or pointer. 
@@ -184,18 +191,18 @@ Assuming the conditions from the previous proposition, the cost to write changes

 Given a bounded connected subset of a plane partitioned into a finite set of faces, the goal is to respond to queries asking to which face a point $p$ belongs. 
 We limit ourselves to polygonal faces.
-One special case of particular importance of this problem is finding the closest point, i.e. when the faces are Voronoi diagrams.
+One special case of particular importance of this problem is finding the closest point, i.e. when the faces are Voronoi cells.

-To build the data structure we will follow a general idea of line-sweeping. 
+To build the data structure we will follow a general idea of line sweeping. 
 We start by sorting vertices of all faces by the first coordinate $x$. 
 We will continually process these vertices in the order of increasing coordinate $x$. 
 We maintain $S$, a sorted list of edges (in a semi-persistent binary search tree) during this processing. 
-The list $S$ contains edges that intersect with a sweeping line parallel to the secondary axis $y$ in order of the intersections (sorted by the second coordinate $y$).
+The list $S$ contains edges that intersect with a sweeping line parallel to the secondary axis~$y$ in order of the intersections (sorted by the second coordinate $y$).

 Initially the list is empty and we set the sweeping line to intersect the first vertex.
-When we start processing a new vertex we imagine moving the sweeping line along the primary axis to intersect with the new vertex. 
+When we start processing a new vertex, we imagine moving the sweeping line along the primary axis to intersect with the new vertex. 
 We can easily observe that the order of edges cannot change during this virtual movement. 
-(None of the edges can intersect except in a vertex.) 
+(None of the edges can intersect except at a vertex.) 
 It will happen, however, that edges must be either removed from the list or added to the list.

 We cannot store keys inside $S$ because the coordinates of intersections with the sweeping line change as it moves. 
@@ -206,8 +213,8 @@ During a query, the first step is to identify which version of $S$ to use.
 This can be done via a binary search in versions of $S$. 
 Then the face is identified by finding which edges in that version of $S$ are closest to the searched point.

-The number of vertices will be denoted $n$, the number of edges is thus bounded by $3n$. 
-This follows from the partitioning being drawing of a planar graph. 
+The number of vertices will be denoted by $n$, the number of edges is thus bounded by $3n$. 
+This follows from the partitioning being a drawing of a planar graph. 
 Complexity is therefore $\O(n \log n)$ for pre-processing and $\O(\log n)$ for one query.

 \figure[point-localization]{point-localization.pdf}{}{Point Localization}
@@ -221,34 +228,76 @@ It is possible to obtain fully-persistent binary search trees with $\O(\log n)$
 \: First of all, we need to decide on how to represent relationships between versions. This is done best via a rooted tree with ordered children. Root of the tree represents the initial empty tree. When an update on version $v$ creates a new version, it is inserted into this tree as a child of $v$. We would like to introduce a dynamic linear order to all versions which would respect the structure of this version tree. Of course, comparison of version must be efficient, ideally taking $\O(1)$ and insert must take at most $\O(\log m)$. This problem is called \em{list ordering} and we address it in the next section.

 \: Another obstacle arises from worst-case complexity of updates. Imagine one update on the structure in version $v$ taking $\omega(\log n)$ time or making changes to $\omega(1)$ nodes. Since this update can be hypothetically repeated unlimited number of times.
-Few binary search trees possess this property and those that do are typically very complicated. Some of the common algorithms for balancing binary search trees like red-black trees or weak-AVL trees can be altered to make $\O(1)$ changes to the tree without structure of the tree being modified.
+Few binary search trees possess this property and those that do are typically very complicated. 
+Some common types of binary search trees like red-black trees or weak-AVL trees can be represented in such a way that updates make changes to $\O(1)$ vertices.
 
 \: In semi-persistence it was not needed to store all information about nodes. It was sufficient to store pointers to children, key, and value for older versions. Fields useful for balancing were not needed for older versions. This is not true for full-persistence. All information must be stored.

 \section{List Ordering}

-Moving from semi-persistence to full-persistence we encountered an obstacle -- versions no longer form implicit linear order. (By versions we mean states of the tree in between updates. We will also use some auxiliary versions not directly mappable to any such state.) Nonetheless, to work with fat vertices, we need to be able to determine slots that carry values correct for current version. To achieve this, we need to identify an interval of versions the current vesion would fall into. For this purpose we will try to introduce an ordering to versions of the persistent data structure.
+Moving from semi-persistence to full persistence we encounter an obstacle -- versions no longer form an implicit linear order. 
+(By versions we mean states of the tree in~between updates. 
+We will also use some auxiliary versions not directly mappable to any such state.) 
+Nonetheless, to work with fat vertices, we need to be able to determine slots that carry values correct for current version. 
+To achieve this, we need to identify an interval of versions the current version would fall into. 
+For this purpose we will try to introduce an ordering to versions of the persistent data structure.

-Version do form a rooted tree with the original version in the root. We can also order children of every vertex by the order of their creation (latest creation first). We then define the desired ordering as the order of vertices corresponding to versions as they are visited by an in-order traversal.
+Versions do form a rooted tree with the original version in the root. 
+We can also order children of every version by the order of their creation (latest creation first). 
+We then define the desired ordering as the order in which the versions are discovered by a depth-first search respecting the order of children.

-In reality, we will also insert other elements into the ordering that will be helper version and will not be used to represent the state of the entire structure after a sequence of operations. This can be disregarded for now.

 With the ordering defined, we still need a way to efficiently represent it in memory. The operations we really need are two:

 \list{o}
 \:{\tt InsertSuccessor(Version)} -- this operation will insert a new version between {\tt Version} and its successor (if any). The newly created version is returned.
-\:{\tt Compare(VersionA, VersionB)} -- returns 1, $-1$, or 0 indicating whether {\tt VersionA} precedes, succeeds, or equals {\tt VersionB}.
+\:{\tt Compare(VersionA, VersionB)} -- returns 1, $-1$, or 0 indicating whether {\tt VersionA} precedes, succeeds, or is equal to {\tt VersionB}.
 \endlist

 We will strive to find a way to assign an integer to each version, these integers will be comparable in constant time. This assignment problem is called \em{list-labeling} and there are several options to tackling it.

-The straight-forward idea would be to assign 0 to the first version and $2^m$ to an artificial upper bound. All newer versions will be assigned the arithmetic average of its successor and predecessor. Now, we see that if $v = (p + s)/2$ and $ 2^k \mid p, s$, then $2^{k-1} \mid v$. The first two integers are divisible by $2^m$ guaranteeing capacity $\Omega(m)$. 
-
-Let us denote the total number of updates to the persistent tree as $n$. It would be reasonable to  assume that arithmetic operations on non-negative integers less than or equal to $n$ can be done in constant time. We are therefore permitted to create only $\O(\log n)$ versions using the straight-forward idea if constant time per operation is needed. This is not sufficient.
-
-To preserve the speed of semi-persistence, {\tt Compare} must be $\O(1)$ and {\tt InsertSuccessor} $\O(\log n)$ at least amortized. Weight-balanced trees that were introduced in the previous chapter are ideally suited for this purpose.
-
-For every vertex in the weight-balanced tree, we will store encoding of the path from root to it in form of sequence of 0s and 1s. 1 for right child, 0 for left child. We know that depth of the tree must be logarithmic, compared integers are composed of $\O(\log n)$ bits, so comparison of versions is efficient. Inserting a successor version is also simple. Rebuilding of some subtree will not change order of versions as all path encodings will be recalculated.
+Suppose we want to be able to assign up to $m$ labels. 
+The straight-forward idea would be to assign 0 to the first item and $2^m$ to an artificial upper bound. 
+All newly inserted will be assigned the arithmetic mean of its successor and predecessor. 
+Now, we see that if $v = (p + s)/2$ and $ 2^k \mid p, s$, then $2^{k-1} \mid v$. 
+The first two integers are divisible by $2^m$, so by induction item inserted by $i$-th subsequent inserted has label divisible by at least $2^{m-i}$, thus guaranteeing capacity $\Omega(m)$. 
+
+Let us denote the total number of updates to the persistent tree by $n$. 
+It would be reasonable to assume that arithmetic operations on non-negative integers less than or equal to $n$ can be done in constant time. 
+We are therefore permitted to create only $\O(\log n)$ versions using the straight-forward idea if constant time per operation is needed. This is not sufficient.
+
+To preserve the speed of semi-persistence, {\tt Compare} must take $\O(1)$ and {\tt InsertSuccessor} $\O(\log n)$ at least amortized. 
+Weight-balanced trees that were introduced in the previous chapter are ideally suited for this purpose.
+
+Every vertex in the tree will correspond to one version.
+
+For every vertex in the weight-balanced tree, we will store encoding of the path from the root to it in form of sequence of zeros and ones. 
+1 for going to a right child and 0 for a left child. 
+Also distance from the root is stored in every vertex. 
+We know that height of the tree must be logarithmic, so the encodings are made of $\O(\log n)$ each and can be interpreted as integers in binary notation. 
+Therefore we can store these encodings as integers.
+
+Comparison of versions then translates to comparison of these integers.
+Suppose we compare two vertices $u$ and $v$ with distances from the root $d_u$ and $d_v$ and paths encoded as integers $e_u$, $e_v$. 
+We only want to compare the first $d$ bits, where $d = min(d_u,d_v)$. 
+In order to compare only the first $d$ bits of the encodings, one of $e_u$, $e_v$ can be divided by an appropriate power of two. 
+If one of the compared numbers is greater, the corresponding vertex is clearly greater. 
+In case of equality, either $u = v$, or one is an ancestor of the other. 
+We can decide between the two options by comparing $e_u$ and $e_v$ for equality for example. 
+If the latter is true and $u$ is without loss of generality ancestor of $v$, the bit at position $d+1$ in the encoding of $v$ determines whether $v$ is the left or right subtree of $u$. 
+We have demonstrated that the comparison can be made in $\O(1)$ time. 
+
+Inserting a successor version is also simple. 
+Rebuilding of some subtree will not change order of versions as all path encodings will be recalculated. 
+Integer arithmetic can be used to efficiently update encodings of paths to the root.
+
+Our use of list ordering will involve one more trick. 
+Suppose we have some versions $a$, $b$, $c$ in that order and a fat vertex with slots for $a$ and $c$. 
+When we insert a successor $a'$ to version $a$ and a slot for that version, we by mistake modify also state of the vertex for version $b$. 
+Therefore, we also need to undo the changes by creating $a''$, a successor to $a'$, and inserting a slot for $a''$ directly after the slot for $a'$. 
+This would result in a fat vertex with slots for $a$, $a'$, $a''$, and $c$.
+
+Before moving to the next section, we remark that Tsakalidis found a method to get $\O(1)$ amortized complexity for insert and delete with weight-balanced trees via indirection.

 \section{Ordered File Maintenance}