diff --git a/03-abtree/Makefile b/03-abtree/Makefile new file mode 100644 index 0000000000000000000000000000000000000000..a725e5445366e4fcf850c204c98c54bcf634ff83 --- /dev/null +++ b/03-abtree/Makefile @@ -0,0 +1,4 @@ +TOP=.. +PICS=ab-example + +include ../Makerules diff --git a/03-abtree/ab-example.asy b/03-abtree/ab-example.asy new file mode 100644 index 0000000000000000000000000000000000000000..df3ac055a689cb9d7564971885bc48746526ece1 --- /dev/null +++ b/03-abtree/ab-example.asy @@ -0,0 +1,78 @@ +import ads; +import trees; + +/* PrvnĂ strom */ + +pair u[]; +real s = 1; +u[0] = (0, 0); +u[1] = u[0] + (-1.2, -s); +u[2] = u[0] + (0, -s); +u[3] = u[0] + (1.2, -s); +u[4] = u[1] + (-0.4, -s); +u[5] = u[1] + (0, -s); +u[6] = u[1] + (0.4, -s); +u[7] = u[2] + (-0.3, -s); +u[8] = u[2] + (0.3, -s); +u[9] = u[3] + (-0.3, -s); +u[10] = u[3] + (0.3, -s); + +tree_init(u); + +real d = 0.1; +real dd = 0.18; +ab_edge(0, 1, -dd); +ab_edge(0, 2); +ab_edge(0, 3, dd); +ab_edge(1, 4, -dd); +ab_edge(1, 5); +ab_edge(1, 6, dd); +ab_edge(2, 7, -d); +ab_edge(2, 8, d); +ab_edge(3, 9, -d); +ab_edge(3, 10, d); + +tree_elliptic_node(0, "4\;7"); +tree_elliptic_node(1, "1\;3"); +tree_elliptic_node(2, "6"); +tree_elliptic_node(3, "9"); +for (int i=4; i<=10; ++i) + tree_ext(i); + +/* DruhĂ˝ strom */ + +pair v[]; +real s = 1; +v[0] = (4.5, 0); +v[1] = v[0] + (-1.2, -s); +v[2] = v[0] + (0, -s); +v[3] = v[0] + (1.2, -s); +v[4] = v[1] + (-0.3, -s); +v[5] = v[1] + (0.3, -s); +v[6] = v[2] + (-0.3, -s); +v[7] = v[2] + (0.3, -s); +v[8] = v[3] + (-0.4, -s); +v[9] = v[3] + (0, -s); +v[10] = v[3] + (0.4, -s); + +tree_init(v); + +real d = 0.1; +real dd = 0.18; +ab_edge(0, 1, -dd); +ab_edge(0, 2); +ab_edge(0, 3, dd); +ab_edge(1, 4, -d); +ab_edge(1, 5, d); +ab_edge(2, 6, -d); +ab_edge(2, 7, d); +ab_edge(3, 8, -dd); +ab_edge(3, 9 ); +ab_edge(3, 10, dd); + +tree_elliptic_node(0, "3\;6"); +tree_elliptic_node(1, "1"); +tree_elliptic_node(2, "4"); +tree_elliptic_node(3, "7\;9"); +for (int i=4; i<=10; ++i) + tree_ext(i); diff --git a/03-abtree/abtree.tex b/03-abtree/abtree.tex new file mode 100644 index 0000000000000000000000000000000000000000..b96fa4691483e63433a45f742606c30c1dc802c3 --- /dev/null +++ b/03-abtree/abtree.tex @@ -0,0 +1,194 @@ +\ifx\chapter\undefined +\input adsmac.tex +\singlechapter{3} +\fi + +\chapter[abtree]{(a,b)-trees} + +In this chapter, we will study an extension of binary search trees. +It will store multiple keys per node, so the nodes will have more than +two children. The structure of such trees will be more complex, but we +will gain much more straightforward balancing operations. + +\section{Worst-case bounds} + +\defn{ +A~\df{multi-way search tree} is a~rooted tree with specified order +of children in every node. Nodes are divided to internal and external. + +Each~\em{internal node} contains one or more distinct keys, stored in +increasing order. A~node with keys $x_1 < \ldots < x_k$ has $k+1$ children +$s_0,\ldots,s_k$. The keys in the node separate keys in the corresponding +subtrees. More formally, if we extend the keys by sentinels $x_0=-\infty$ and +$x_{k+1}=+\infty$, every key~$y$ in the subtree $T(s_i)$ satisfies $x_i < y < x_i$. + +\em{External nodes} carry no data and they have no children. These are the +leaves of the tree. In a~program, we can represent them by null pointers. +} + +\obs{ +Searching in multi-way trees is similar to binary trees. We start at the root. +In each node, we compare the desired key with all keys of the node. Then we either +finish or continue in a~uniquely determined subtree. When we reach an external node, +we conclude that the requested key is not present. + +The universe is split to open intervals by the keys of the tree. There is a~1-to-1 +correspondence between these intervals and external nodes of the tree. This means +that searches for all keys from the same interval end in the same extrnal node. +} + +As in binary search trees, multiway-trees can become degenerate. We therefore need +to add further invariants to keep the trees balanced. + +\defn{ +An~\em{(a,b)-tree} for parameters $a\ge 2$ and $b\ge 2a-1$ is a~multi-way +search tree, which satisfies: +\tightlist{n.} +\:The root has between 2 and~$b$ children (unless it is a~leaf, which happens + when the set of keys is empty). Every other internal node + has between $a$ and~$b$ children. +\:All external nodes have the same depth. +\endlist +} + +The requirements on $a$ and~$b$ may seem mysterious, but they will become +clear when we define operations. +The minimum working type of $(a,b)$-trees are $(2,3)$-trees --- see +figure~\figref{abexample} for examples. Internal nodes are drawn round, +external nodes are square. + +\figure[abexample]{ab-example.pdf}{}{Two $(2,3)$-trees for the same set of keys} + +Unlike in other texts, we will \em{not} assume that $a$ and~$b$ are constants, +which can be hidden in the~$\O$. This will help us to examine how the performance +of the structure depends on these parameters. + +We will start with bounding the height of $(a,b)$-trees. Height will be measured +in edges, a~tree of height~$h$ has the root at depth~0 and external nodes at depth~$h$. +We will call the set of nodes at depth~$i$ the $i$-th \em{level.} + +\lemma{ +The height of an~$(a,b)$-tree with $n$~keys lies between +$\log_b(n+1)$ and $1 + \log_a((n+1)/2)$. +} + +\proof +Let us start with the upper bound. We will calculate the minimum number of keys +in a~tree of height~$h$. The minimum will be attained by a~tree where each node +contains the minimum possible number of keys, so it has the minimum possible number +of children. Level~0 contains only the root with 1~key. Level~$h$ contains only +external nodes. The $i$-th level inbetween contains $2a^{i-1}$ nodes with $a-1$ keys +each. Summing over levels, we get: +$$ + 1 + \sum_{i=1}^{h-1} 2a^{i-1}(a-1) + = 1 + 2(a-1) \sum_{i=0}^{h-2} a^i + = 1 + 2(a-1) {(a^{h-1}-1)\over (a-1)} + = 2a^{h-1} - 1. +$$ +So $n \ge 2a^{h-1} - 1$ in any tree of height~$h$. Solving this for~$h$ yields +$h\le 1 + \log_a((n+1)/2)$. + +For the lower bound, we consider the maximum number of keys for height~$h$. +All nodes will contain the highest possible number of keys. Thus we will have +$b^i$~nodes at level~$i$, each with $b-1$ keys. In total, the number of keys will reach +$$ + \sum_{i=0}^{h-1} b^i(b-1) + = (b-1) \sum_{i=0}^{h-1} b^i + = (b-1) \cdot {b^h-1 \over b-1} + = b^h - 1. +$$ +Therefore in each tree, we have $n \le b^h-1$, so $h \ge \log_b (n+1)$. +\qed + +\corr{The height is $\Omega(\log_b n)$ and $\O(\log_a n)$.} + +\subsection{Searching for a~key} + +$\alg{Find}(x)$ follows the general algorithm for multi-way trees. +It visits $\O(\log_a n) = \O(\log n/\log a)$ nodes. In each node, it compares~$x$ with +all keys of the node, which can be performed in time $\O(\log b)$ by binary search. +In total, we spend time $\Theta(\log n \cdot \log b / \log a)$. + +If $b$ is~polynomial in~$a$, the ratio of logarithms is $\Theta(1)$, so the complexity +of \alg{Find} is $\Theta(\log n)$. This is optimum since each non-final comparison brings +at most 1~bit of information and we need to gather $\log n$ bits to determine the result. + +\subsection{Insertion} + +TODO + +An~\alg{Insert} takes $\Theta(b \cdot \log n / \log a)$ time. + +\subsection{Deletion} + +TODO + +A~\alg{Delete} takes $\Theta(b \cdot \log n / \log a)$ time. + +\subsection{The choice of parameters} + +For a~fixed~$a$, the complexity of all three operations increases with~$b$, +so we should set $b$~small. The usual choices are the minimum value $2a-1$ +allowed by the definition or $2a$. As we will see in the following sections, +the latter value has several advantages. + +If $b\in\Theta(a)$, the complexity of \alg{Find} becomes $\Theta(\log n)$. +Both \alg{Insert} and \alg{Delete} will run in time $\Theta(\log n \cdot (a/\log a))$. +We can therefore conclude that we want to set $a$ to~2 or possibly to another small +constant. The best choices are the $(2,3)$-tree and the $(2,4)$-tree. + +This is true on the RAM, or on any machine with ideal random-access memory. +If we store the tree on a~disk, the situation changes. A~disk is divided to blocks +(the typical size of a~block is on the order of kilobytes) and I/O is performed on +whole blocks. Reading a~single byte is therefore as expensive as reading the full +block. In this situation, we will set~$a$ to make the size of a~node match the size +of the block. Consider an example with $4\,{\rm KB}$ blocks, 32-bit keys, and 32-bit pointers. +If we use the $(256,511)$-tree, one node will fit in a~block and the tree will be +very shallow: four levels suffice for storing more than 33~million keys. Furthermore, +the last level contains only external nodes and we can keep the root cached in memory, +so each search will read only 2~blocks. + +The same principle actually applies to main memory of contemporary computers, too. +They usually employ a~fast \em{cache} between the processor and the (much slower) +main memory. The communication between the cache and the memory involves transfer +of \em{cache lines} --- blocks of typical size around $64\,{\rm B}$. It therefore +helps if we match the size of nodes with the size of cache lines and we align the start +of nodes to a~multiple of cache line size. For 32-bit keys and 32-bit pointers, +we can use a~$(4,7)$-tree. + +\subsection{Other versions} + +Our definition of $(a,b)$-trees is not the only one used. Some authors prefer +to store useful keys only on the last level and let the other internal nodes +contain copies of these keys (typically minima of subtrees) for easy +navigation. This requires minor modifications to our procedures, but the +asymptotics stay the same. This version can be useful in databases, which usually +associate potentially large data with each key. They have two different formats +of nodes, possibly with different choices of $a$ and~$b$: leaves, which contain +keys with associated data, and internal nodes with keys and pointers. + +Database theorists often prefer the name \em{B-trees.} There are many definitions +of B-trees in the wild, but they are usually equivalent to $(a,2a-1)$-trees or +$(a,2a)$-trees, possibly modified as in the previous paragraph. + +\section{Amortized analysis} + +TODO + +\theorem{A~sequence of $m$~\alg{Insert}s on an~initially empty $(a,b)$-tree +performs $\O(m)$ node modifications.} + +\theorem{A~sequence of $m$~\alg{Insert}s and \alg{Delete}s on an~initially empty $(a,2a)$-tree +performs $\O(m)$ node modifications.} + +\subsection{A-sort} + +TODO + +\section{Top-down (a,b)-trees and parallel access} + +\section{Red-black trees} + +TODO + +\endchapter