Add OFM

7e7a5f7c · Jiri Skrobanek · 685937f2 · 7e7a5f7c
Commit 7e7a5f7c authored Apr 17, 2021 by Jiri Skrobanek
--- a/201-persist/persist.tex
+++ b/201-persist/persist.tex
@@ -73,15 +73,15 @@ It will be addressed later through introduction of total ordering of all version
 Looking at the memory consumption, it appears clear that the amount consumed stems from the total number of changes done by the balancing algorithm across all operations.
-What remains to solve though is to choose the right kind of collection for versions of changes for one vertex. Surely using linked lists or arrays will inevitably lead to unsatisfactorily inefficient processing of applicable changes. Unfortunately as it turns out, even other data structures will let us down here.
+What remains to solve though is to choose the right kind of collection for versions of changes for one vertex. Surely using linked lists or arrays will inevitably lead to unsatisfactorily inefficient processing of applicable changes. Unfortunately as it turns out, even other data structures will let us down here and complexity will increase.
-With this approach, we can reach space complexity that is linear with the total number of changes in the tree. On the other hand execution time will be hurt. We can reach the following amortized complexity depending on the chosen data structure:
+With this approach, we can reach space complexity that is linear with the total number of changes in the tree. On the other hand execution time will be hurt. If we have a tree with $m$ versions, time spent at each vertex increases to $\O(\log m)$ with collections inside vertices implemented as binary search trees. This can be improved by using more suitable data structures e.g. van Emde Boasian.
 \section{Pointer Data Structures and Semi-Persistence}
 Following up on the idea of fat vertices, let us limit their size to achieve identification of applicable version in constant time. 
-We will explain this technique only for semi-persistence first. 
+We will explain this technique for semi-persistence only first. 
-Full persistence requires the use of a few extra tricks and establishing an ordering on the versions.
+Full persistence is more complicated and requires a few extra tricks.
 A fat vertex stores a dictionary of standard vertices indexed by versions. 
 We call values of this dictionary slots. 
@@ -158,9 +158,20 @@ The number of vertices will be denoted $n$, the number of edges is thus bounded
 \section{A Sketch of Full-Persistence}
+It is possible to obtain fully-persistent binary search trees with $\O(\log n)$ time complexity per operation if $n$ is the total number of updates. Couple of obstacles must be overcome however. 
+\list{o}
+\: First of all, we need to decide on how to represent relationships between versions. This is done best via a rooted tree with ordered children. Root of the tree represents the initial empty tree. When an update on version $v$ creates a new version, it is inserted into this tree as a child of $v$. We would like to introduce a dynamic linear order to all versions which would respect the structure of this version tree. Of course, comparison of version must be efficient, ideally taking $\O(1)$ and insert must take at most $\O(\log m)$. This problem is called \em{list ordering} and we address it in the next section.
+\: Another obstacle arises from worst-case complexity of updates. Imagine one update on the structure in version $v$ taking $\omega(\log n)$ time or making changes to $\omega(1)$ nodes. Since this update can be hypothetically repeated unlimited number of times.
+Few binary search trees possess this property and those that do are typically very complicated. Some of the common algorithms for balancing binary search trees like red-black trees or weak-AVL trees can be altered to make $\O(1)$ changes to the tree without structure of the tree being modified.
+\: In semi-persistence it was not needed to store all information about nodes. It was sufficient to store pointers to children, key, and value for older versions. Fields useful for balancing were not needed for older versions. This is not true for full-persistence. All information must be stored.
 \section{List Ordering}
-Moving from semi-persistence to full-persistence we encounter an obstacle -- versions no longer form implicit linear order. (By versions we mean states of the tree in between updates. We will also use some auxiliary versions not directly mappable to any such state.) Nonetheless, to work with fat vertices, we need to be able to determine slots that carry values correct for current version. To achieve this, we need to identify an interval of versions the current vesion would fall into. For this purpose we will try to introduce an ordering to versions of the persistent data structure.
+Moving from semi-persistence to full-persistence we encountered an obstacle -- versions no longer form implicit linear order. (By versions we mean states of the tree in between updates. We will also use some auxiliary versions not directly mappable to any such state.) Nonetheless, to work with fat vertices, we need to be able to determine slots that carry values correct for current version. To achieve this, we need to identify an interval of versions the current vesion would fall into. For this purpose we will try to introduce an ordering to versions of the persistent data structure.
 Version do form a rooted tree with the original version in the root. We can also order children of every vertex by the order of their creation (latest creation first). We then define the desired ordering as the order of vertices corresponding to versions as they are visited by an in-order traversal.
@@ -183,4 +194,51 @@ To preserve the speed of semi-persistence, {\tt Compare} must be $\O(1)$ and {\t
 For every vertex in the weight-balanced tree, we will store encoding of the path from root to it in form of sequence of 0s and 1s. 1 for right child, 0 for left child. We know that depth of the tree must be logarithmic, compared integers are composed of $\O(\log n)$ bits, so comparison of versions is efficient. Inserting a successor version is also simple. Rebuilding of some subtree will not change order of versions as all path encodings will be recalculated.
+\section{Ordered File Maintenance}
+In this section we will try to address the list-order problem in a bit different fashion. 
+We want to store items $n$ in our ordered list in one array of size $\O(n)$ -- we allow interspersed empty records.
+This constraint forces us to leave our method of using paths in a tree as keys. 
+On the other hand, in a sorted array comparison in constant time is simple.
+As already mentioned, our structure at its core is an array. 
+Conceptually however, we think of it as a complete binary tree. 
+We also imagine that indirection is used on blocks of size roughly $\O(\log n)$ and these blocks are leaves of our conceptual tree. 
+Every level of our conceptual tree has a prescribed density of records. 
+Precisely, density means that ratio of the count of all occupied records in all blocks which are leaves in a subtree of a given vertex to all records in leaves of that subtree. 
+When we have a vertex $v$ in distance $i$ from the root, its density must fall into the interval $[1/2 - i/(4h), 3/4 + i/(4h)]$, where $h$ is the height of the conceptual tree. 
+This interval is called \em{standard density}. 
+We can observe that the density constraint is more strict for nodes closer to root.
+For all nodes in the tree standard density will be maintained.
+Insertion will work in two steps. 
+First of all, new entry is inserted into the correct block. 
+Then we check density of that block. 
+If the density is nonstandard we proceed towards the root of the conceptual tree until a vertex with standard density is found. 
+When we reach a node with standard density, records in its subtree must be distributed among its blocks evenly. 
+This step also restores density for all other vertices inside the subtree.
+If even the root has nonstandard density (higher than standard of course), the size of the array is increased be a constant factor and all records distributed evenly inside it.
+In this case new size of blocks is chosen and new conceptual tree is considered.
+Deletion is very similar to insertion with the difference that when root is found to have nonstandard density, the array must be shrunk by a constant multiple instead of expanded.
+Now that we have described the algorithm, let us find out what the complexity is.
+\theorem{
+A sequence of $n$ inserts and deletes on empty OFM data structure takes $\O(n \log^2 n)$.
+}
+\proof
+For simplicity we only analyze insert. 
+Assume that redistribution is needed at level $i$ during insert. 
+That means in particular than one child has exceeded standard density of ${3 \over 4} + {i+1 \over 4h}$ while this vertex has density at most ${3 \over 4} + {i \over 4h}$. 
+This implies that since the last redistribution at least $1 \over 4h$ times the count of records in this subtree have been inserted.
+Redistribution of records takes time linear with the size of the subtree. 
+This means that to pay for this redistribution, it suffices for those inserted records to pay $\O(\log n)$. 
+Of course this must be paid for every level of our conceptual tree, bringing the total to $\O(\log^2 n)$.
+Similarly, if expansion of the array is needed, this can be amortized through paying a constant for every insert.
+\qed
 \endchapter