(a,b)-trees: init

67e39491 · Martin Mareš · 951f4821 · 67e39491 · 67e39491 · 67e39491
Commit 67e39491 authored 6 years ago by Martin Mareš
--- a/03-abtree/Makefile
+++ b/03-abtree/Makefile
+TOP=..
+PICS=ab-example
+
+include ../Makerules
--- a/03-abtree/ab-example.asy
+++ b/03-abtree/ab-example.asy
+import ads;
+import trees;
+
+/* První strom */
+
+pair u[];
+real s = 1;
+u[0] = (0, 0);
+u[1] = u[0] + (-1.2, -s);
+u[2] = u[0] + (0, -s);
+u[3] = u[0] + (1.2, -s);
+u[4] = u[1] + (-0.4, -s);
+u[5] = u[1] + (0, -s);
+u[6] = u[1] + (0.4, -s);
+u[7] = u[2] + (-0.3, -s);
+u[8] = u[2] + (0.3, -s);
+u[9] = u[3] + (-0.3, -s);
+u[10] = u[3] + (0.3, -s);
+
+tree_init(u);
+
+real d = 0.1;
+real dd = 0.18;
+ab_edge(0, 1, -dd);
+ab_edge(0, 2);
+ab_edge(0, 3, dd);
+ab_edge(1, 4, -dd);
+ab_edge(1, 5);
+ab_edge(1, 6, dd);
+ab_edge(2, 7, -d);
+ab_edge(2, 8, d);
+ab_edge(3, 9, -d);
+ab_edge(3, 10, d);
+
+tree_elliptic_node(0, "4\;7");
+tree_elliptic_node(1, "1\;3");
+tree_elliptic_node(2, "6");
+tree_elliptic_node(3, "9");
+for (int i=4; i<=10; ++i)
+	tree_ext(i);
+
+/* Druhý strom */
+
+pair v[];
+real s = 1;
+v[0] = (4.5, 0);
+v[1] = v[0] + (-1.2, -s);
+v[2] = v[0] + (0, -s);
+v[3] = v[0] + (1.2, -s);
+v[4] = v[1] + (-0.3, -s);
+v[5] = v[1] + (0.3, -s);
+v[6] = v[2] + (-0.3, -s);
+v[7] = v[2] + (0.3, -s);
+v[8] = v[3] + (-0.4, -s);
+v[9] = v[3] + (0, -s);
+v[10] = v[3] + (0.4, -s);
+
+tree_init(v);
+
+real d = 0.1;
+real dd = 0.18;
+ab_edge(0, 1, -dd);
+ab_edge(0, 2);
+ab_edge(0, 3, dd);
+ab_edge(1, 4, -d);
+ab_edge(1, 5, d);
+ab_edge(2, 6, -d);
+ab_edge(2, 7, d);
+ab_edge(3, 8, -dd);
+ab_edge(3, 9 );
+ab_edge(3, 10, dd);
+
+tree_elliptic_node(0, "3\;6");
+tree_elliptic_node(1, "1");
+tree_elliptic_node(2, "4");
+tree_elliptic_node(3, "7\;9");
+for (int i=4; i<=10; ++i)
+	tree_ext(i);
--- a/03-abtree/abtree.tex
+++ b/03-abtree/abtree.tex
+\ifx\chapter\undefined
+\input adsmac.tex
+\singlechapter{3}
+\fi
+
+\chapter[abtree]{(a,b)-trees}
+
+In this chapter, we will study an extension of binary search trees.
+It will store multiple keys per node, so the nodes will have more than
+two children. The structure of such trees will be more complex, but we
+will gain much more straightforward balancing operations.
+
+\section{Worst-case bounds}
+
+\defn{
+A~\df{multi-way search tree} is a~rooted tree with specified order
+of children in every node. Nodes are divided to internal and external.
+
+Each~\em{internal node} contains one or more distinct keys, stored in
+increasing order. A~node with keys $x_1 < \ldots < x_k$ has $k+1$ children
+$s_0,\ldots,s_k$. The keys in the node separate keys in the corresponding
+subtrees. More formally, if we extend the keys by sentinels $x_0=-\infty$ and
+$x_{k+1}=+\infty$, every key~$y$ in the subtree $T(s_i)$ satisfies $x_i < y < x_i$.
+
+\em{External nodes} carry no data and they have no children. These are the
+leaves of the tree. In a~program, we can represent them by null pointers.
+}
+
+\obs{
+Searching in multi-way trees is similar to binary trees. We start at the root.
+In each node, we compare the desired key with all keys of the node. Then we either
+finish or continue in a~uniquely determined subtree. When we reach an external node,
+we conclude that the requested key is not present.
+
+The universe is split to open intervals by the keys of the tree. There is a~1-to-1
+correspondence between these intervals and external nodes of the tree. This means
+that searches for all keys from the same interval end in the same extrnal node.
+}
+
+As in binary search trees, multiway-trees can become degenerate. We therefore need
+to add further invariants to keep the trees balanced.
+
+\defn{
+An~\em{(a,b)-tree} for parameters $a\ge 2$ and $b\ge 2a-1$ is a~multi-way
+search tree, which satisfies:
+\tightlist{n.}
+\:The root has between 2 and~$b$ children (unless it is a~leaf, which happens
+  when the set of keys is empty). Every other internal node
+  has between $a$ and~$b$ children.
+\:All external nodes have the same depth.
+\endlist
+}
+
+The requirements on $a$ and~$b$ may seem mysterious, but they will become
+clear when we define operations.
+The minimum working type of $(a,b)$-trees are $(2,3)$-trees --- see
+figure~\figref{abexample} for examples. Internal nodes are drawn round,
+external nodes are square.
+
+\figure[abexample]{ab-example.pdf}{}{Two $(2,3)$-trees for the same set of keys}
+
+Unlike in other texts, we will \em{not} assume that $a$ and~$b$ are constants,
+which can be hidden in the~$\O$. This will help us to examine how the performance
+of the structure depends on these parameters.
+
+We will start with bounding the height of $(a,b)$-trees. Height will be measured
+in edges, a~tree of height~$h$ has the root at depth~0 and external nodes at depth~$h$.
+We will call the set of nodes at depth~$i$ the $i$-th \em{level.}
+
+\lemma{
+The height of an~$(a,b)$-tree with $n$~keys lies between
+$\log_b(n+1)$ and $1 + \log_a((n+1)/2)$.
+}
+
+\proof
+Let us start with the upper bound. We will calculate the minimum number of keys
+in a~tree of height~$h$. The minimum will be attained by a~tree where each node
+contains the minimum possible number of keys, so it has the minimum possible number
+of children. Level~0 contains only the root with 1~key. Level~$h$ contains only
+external nodes. The $i$-th level inbetween contains $2a^{i-1}$ nodes with $a-1$ keys
+each. Summing over levels, we get:
+$$
+	    1 + \sum_{i=1}^{h-1} 2a^{i-1}(a-1)
+	  = 1 + 2(a-1) \sum_{i=0}^{h-2} a^i
+	  = 1 + 2(a-1) {(a^{h-1}-1)\over (a-1)}
+	  = 2a^{h-1} - 1.
+$$
+So $n \ge 2a^{h-1} - 1$ in any tree of height~$h$. Solving this for~$h$ yields
+$h\le 1 + \log_a((n+1)/2)$.
+
+For the lower bound, we consider the maximum number of keys for height~$h$.
+All nodes will contain the highest possible number of keys. Thus we will have
+$b^i$~nodes at level~$i$, each with $b-1$ keys. In total, the number of keys will reach
+$$
+	  \sum_{i=0}^{h-1} b^i(b-1)
+	= (b-1) \sum_{i=0}^{h-1} b^i
+	= (b-1) \cdot {b^h-1 \over b-1}
+	= b^h - 1.
+$$
+Therefore in each tree, we have $n \le b^h-1$, so $h \ge \log_b (n+1)$.
+\qed
+
+\corr{The height is $\Omega(\log_b n)$ and $\O(\log_a n)$.}
+
+\subsection{Searching for a~key}
+
+$\alg{Find}(x)$ follows the general algorithm for multi-way trees.
+It visits $\O(\log_a n) = \O(\log n/\log a)$ nodes. In each node, it compares~$x$ with
+all keys of the node, which can be performed in time $\O(\log b)$ by binary search.
+In total, we spend time $\Theta(\log n \cdot \log b / \log a)$.
+
+If $b$ is~polynomial in~$a$, the ratio of logarithms is $\Theta(1)$, so the complexity
+of \alg{Find} is $\Theta(\log n)$. This is optimum since each non-final comparison brings
+at most 1~bit of information and we need to gather $\log n$ bits to determine the result.
+
+\subsection{Insertion}
+
+TODO
+
+An~\alg{Insert} takes $\Theta(b \cdot \log n / \log a)$ time.
+
+\subsection{Deletion}
+
+TODO
+
+A~\alg{Delete} takes $\Theta(b \cdot \log n / \log a)$ time.
+
+\subsection{The choice of parameters}
+
+For a~fixed~$a$, the complexity of all three operations increases with~$b$,
+so we should set $b$~small. The usual choices are the minimum value $2a-1$
+allowed by the definition or $2a$. As we will see in the following sections,
+the latter value has several advantages.
+
+If $b\in\Theta(a)$, the complexity of \alg{Find} becomes $\Theta(\log n)$.
+Both \alg{Insert} and \alg{Delete} will run in time $\Theta(\log n \cdot (a/\log a))$.
+We can therefore conclude that we want to set $a$ to~2 or possibly to another small
+constant. The best choices are the $(2,3)$-tree and the $(2,4)$-tree.
+
+This is true on the RAM, or on any machine with ideal random-access memory.
+If we store the tree on a~disk, the situation changes. A~disk is divided to blocks
+(the typical size of a~block is on the order of kilobytes) and I/O is performed on
+whole blocks. Reading a~single byte is therefore as expensive as reading the full
+block. In this situation, we will set~$a$ to make the size of a~node match the size
+of the block. Consider an example with $4\,{\rm KB}$ blocks, 32-bit keys, and 32-bit pointers.
+If we use the $(256,511)$-tree, one node will fit in a~block and the tree will be
+very shallow: four levels suffice for storing more than 33~million keys. Furthermore,
+the last level contains only external nodes and we can keep the root cached in memory,
+so each search will read only 2~blocks.
+
+The same principle actually applies to main memory of contemporary computers, too.
+They usually employ a~fast \em{cache} between the processor and the (much slower)
+main memory. The communication between the cache and the memory involves transfer
+of \em{cache lines} --- blocks of typical size around $64\,{\rm B}$. It therefore
+helps if we match the size of nodes with the size of cache lines and we align the start
+of nodes to a~multiple of cache line size. For 32-bit keys and 32-bit pointers,
+we can use a~$(4,7)$-tree.
+
+\subsection{Other versions}
+
+Our definition of $(a,b)$-trees is not the only one used. Some authors prefer
+to store useful keys only on the last level and let the other internal nodes
+contain copies of these keys (typically minima of subtrees) for easy
+navigation. This requires minor modifications to our procedures, but the
+asymptotics stay the same. This version can be useful in databases, which usually
+associate potentially large data with each key. They have two different formats
+of nodes, possibly with different choices of $a$ and~$b$: leaves, which contain
+keys with associated data, and internal nodes with keys and pointers.
+
+Database theorists often prefer the name \em{B-trees.} There are many definitions
+of B-trees in the wild, but they are usually equivalent to $(a,2a-1)$-trees or
+$(a,2a)$-trees, possibly modified as in the previous paragraph.
+
+\section{Amortized analysis}
+
+TODO
+
+\theorem{A~sequence of $m$~\alg{Insert}s on an~initially empty $(a,b)$-tree
+performs $\O(m)$ node modifications.}
+
+\theorem{A~sequence of $m$~\alg{Insert}s and \alg{Delete}s on an~initially empty $(a,2a)$-tree
+performs $\O(m)$ node modifications.}
+
+\subsection{A-sort}
+
+TODO
+
+\section{Top-down (a,b)-trees and parallel access}
+
+\section{Red-black trees}
+
+TODO
+
+\endchapter