Skip to content
Snippets Groups Projects
Select Git revision
  • 0e78c590f981f3cea39a2aa3d6d00a91266cc373
  • devel default
  • master
  • fo
  • jirka/typing
  • fo-base
  • mj/submit-images
  • jk/issue-96
  • jk/issue-196
  • honza/add-contestant
  • honza/mr7
  • honza/mrf
  • honza/mrd
  • honza/mra
  • honza/mr6
  • honza/submit-images
  • honza/kolo-vs-soutez
  • jh-stress-test-wip
  • shorten-schools
19 results

create-user

Blame
  • abtree.tex 17.31 KiB
    \ifx\chapter\undefined
    \input adsmac.tex
    \singlechapter{3}
    \fi
    
    \chapter[abtree]{(a,b)-trees}
    
    In this chapter, we will study an extension of binary search trees.
    It will store multiple keys per node, so the nodes will have more than
    two children. The structure of such trees will be more complex, but we
    will gain much more straightforward balancing operations.
    
    \section{Definition and operations}
    
    \defn{
    A~\df{multi-way search tree} is a~rooted tree with specified order
    of children in every node. Nodes are divided to internal and external.
    
    Each~\em{internal node} contains one or more distinct keys, stored in
    increasing order. A~node with keys $x_1 < \ldots < x_k$ has $k+1$ children
    $s_0,\ldots,s_k$. The keys in the node separate keys in the corresponding
    subtrees. More formally, if we extend the keys by sentinels $x_0=-\infty$ and
    $x_{k+1}=+\infty$, every key~$y$ in the subtree $T(s_i)$ satisfies $x_i < y < x_{i+1}$.
    
    \em{External nodes} carry no data and they have no children. These are the
    leaves of the tree. In a~program, we can represent them by null pointers.
    }
    
    \obs{
    Searching in multi-way trees is similar to binary trees. We start at the root.
    In each node, we compare the desired key with all keys of the node. Then we either
    finish or continue in a~uniquely determined subtree. When we reach an external node,
    we conclude that the requested key is not present.
    
    The universe is split to open intervals by the keys of the tree. There is a~1-to-1
    correspondence between these intervals and external nodes of the tree. This means
    that searches for all keys from the same interval end in the same extrnal node.
    }
    
    As in binary search trees, multiway-trees can become degenerate. We therefore need
    to add further invariants to keep the trees balanced.
    
    \defn{
    An~\em{(a,b)-tree} for parameters $a\ge 2$ and $b\ge 2a-1$ is a~multi-way
    search tree, which satisfies:
    \tightlist{n.}
    \:The root has between 2 and~$b$ children (unless it is a~leaf, which happens
      when the set of keys is empty). Every other internal node
      has between $a$ and~$b$ children.
    \:All external nodes have the same depth.
    \endlist
    }
    
    The requirements on $a$ and~$b$ may seem mysterious, but they will become
    clear when we define operations.
    The minimum working type of $(a,b)$-trees are $(2,3)$-trees --- see
    figure~\figref{abexample} for examples. Internal nodes are drawn round,
    external nodes are square.
    
    \figure[abexample]{ab-example.pdf}{}{Two $(2,3)$-trees for the same set of keys}
    
    Unlike in other texts, we will \em{not} assume that $a$ and~$b$ are constants,
    which can be hidden in the~$\O$. This will help us to examine how the performance
    of the structure depends on these parameters.
    
    We will start with bounding the height of $(a,b)$-trees. Height will be measured
    in edges, a~tree of height~$h$ has the root at depth~0 and external nodes at depth~$h$.
    We will call the set of nodes at depth~$i$ the $i$-th \em{level.}
    
    \lemma{
    The height of an~$(a,b)$-tree with $n$~keys lies between
    $\log_b(n+1)$ and $1 + \log_a((n+1)/2)$.
    }
    
    \proof
    Let us start with the upper bound. We will calculate the minimum number of keys
    in a~tree of height~$h$. The minimum will be attained by a~tree where each node
    contains the minimum possible number of keys, so it has the minimum possible number
    of children. Level~0 contains only the root with 1~key. Level~$h$ contains only
    external nodes. The $i$-th level inbetween contains $2a^{i-1}$ nodes with $a-1$ keys
    each. Summing over levels, we get:
    $$
    	    1 + \sum_{i=1}^{h-1} 2a^{i-1}(a-1)
    	  = 1 + 2(a-1) \sum_{i=0}^{h-2} a^i
    	  = 1 + 2(a-1) {(a^{h-1}-1)\over (a-1)}
    	  = 2a^{h-1} - 1.
    $$
    So $n \ge 2a^{h-1} - 1$ in any tree of height~$h$. Solving this for~$h$ yields
    $h\le 1 + \log_a((n+1)/2)$.
    
    For the lower bound, we consider the maximum number of keys for height~$h$.
    All nodes will contain the highest possible number of keys. Thus we will have
    $b^i$~nodes at level~$i$, each with $b-1$ keys. In total, the number of keys will reach
    $$
    	  \sum_{i=0}^{h-1} b^i(b-1)
    	= (b-1) \sum_{i=0}^{h-1} b^i
    	= (b-1) \cdot {b^h-1 \over b-1}
    	= b^h - 1.
    $$
    Therefore in each tree, we have $n \le b^h-1$, so $h \ge \log_b (n+1)$.
    \qed
    
    \cor{The height is $\Omega(\log_b n)$ and $\O(\log_a n)$.}
    
    \subsection{Searching for a~key}
    
    $\alg{Find}(x)$ follows the general algorithm for multi-way trees.
    It visits $\O(\log_a n)$ nodes, which is $\O(\log n/\log a)$. In each node, it compares~$x$ with
    all keys of the node, which can be performed in time $\O(\log b)$ by binary search.
    In total, we spend time $\Theta(\log n \cdot \log b / \log a)$.
    
    If $b$ is~polynomial in~$a$, the ratio of logarithms is $\Theta(1)$, so the complexity
    of \alg{Find} is $\Theta(\log n)$. This is optimum since each non-final comparison brings
    at most 1~bit of information and we need to gather $\log n$ bits to determine the result.
    
    \subsection{Insertion}
    
    If we want to insert a~key, we try to find it first. If the key is not present
    yet, the search ends in a~leaf (external node). However, we cannot simply turn this
    leaf into an~internal node with two external children --- this would break the
    axiom that all leaves lie on the same level.
    
    Instead, we insert the key to the parent of the external node --- that is, to a~node
    on the lowest internal level. Adding a~key requires adding a~child, so we add a~leaf.
    This is correct since all other children of that node are also leaves.
    
    If the node still has at most $b-1$ keys, we are done. Otherwise, we split
    the overfull node to two and distribute the keys approximately equally. In the
    parent of the split node, we need to replace one child pointer by two, so we have
    to add a~key to the parent. We solve this by moving the middle key of the
    overfull node to the parent. Therefore we are splitting the overfull node to three
    parts: the middle key is moved to the parent, all smaller keys form one new node,
    and all larger keys form the other one. Children will be distributed among the new
    nodes in the only possible way.
    
    \figure{ab-ins.pdf}{}{Splitting an overfull node on insertion to a~$(2,3)$-tree}
    
    This way, we have reduced insertion of a~key to the current node to insertion
    of a~key to its parent. Again, this can lead to the parent overflowing, so the
    splitting can continue, possibly up to the root. If it happens that we split
    the root, we create a~new root with a~single key and two children (this is correct,
    since we allowed less than $a$~children in the root). This increases the height
    of the tree by~1.
    
    Let us calculate time complexity of \alg{Insert}. In the worst case, we visit
    $\Theta(1)$ nodes on each level and we spend $\Theta(b)$ time on each node. This
    makes $\Theta(b \cdot \log n / \log a)$ time total.
    
    It remains to show that nodes created by splitting are not undersized, meaning they
    have at least~$a$ children. We split a~node~$v$ when it reached $b+1$ children,
    so it had $b$~keys. We send one key to the parent, so the new nodes $v_1$ and~$v_2$
    will take $\lfloor (b-1)/2\rfloor$ and $\lceil (b-1)/2\rceil$ keys. If any of them
    were undersized, we would have $(b-1)/2 < a-1$ and thus $b < 2a-1$. Voilà, this explains
    why we put the condition $b\ge 2a-1$ in the definition.
    
    \subsection{Deletion}
    
    If we want to delete a~key, we find it first. If it is located on the last internal
    level, we can delete it directly, together with one of the leaves under it.
    We still have to check for underflow, though.
    
    Keys located on the higher levels cannot be removed directly --- the internal node
    would lose one pointer and we would have a~subtree left in our hands with no place
    to connect it to. This situation is similar to deletion of a~node with two children
    in a~binary search tree, so we can solve it similarly: We replace the deleted key
    by its successor (which is the leftmost key in the deleted key's right subtree).
    The successor lies on the last internal level, so it can be deleted directly.
    
    The only remaining problem is fixing an undersized node. For a~moment we will assume
    that the node is not the root, so it has $a-2$ keys. It is tempting to solve the underflow
    by merging the node with one of its siblings. However, this can be done only if
    the sibling contains few keys; otherwise, the merged node could be overfull. But if the
    sibling is large, we can fix our problem by borrowing a~key from it.
    
    Let us be exact. Suppose that we have an undersized node~$v$ with
    $a-2$ keys and this node has a~left sibling~$\ell$ separated by a~key~$p$ in their
    common parent. If there is no left sibling, we use the right sibling and follow
    a~mirror image of the procedure.
    
    If the sibling has only~$a$ children, we merge nodes~$v$ and~$\ell$ to a~single
    node and we also move the key~$p$ from the parent there. This creates a~node with
    $(a-2) + (a-1) + 1 = 2a-2$ keys, which cannot exceed $b-1$. Therefore we reduced
    deletion from~$v$ to deletion from its parent.
    
    \figure{ab-del-merge.pdf}{}{Merging nodes on deletion from a~$(2,3)$-tree}
    
    On the contrary, if the sibling has more than~$a$ children, we disconnect its
    rightmost child~$c$ and its largest key~$m$. Then we move the key~$m$ to the parent
    and the key~$p$ from the parent to~$v$. There, $p$~becomes the smallest key, before
    which we connect the child~$c$. After this ``rotation of keys'', all nodes will
    have numbers of children in the allowed range, so we can stop.
    
    \figure{ab-del-borrow.pdf}{}{Borrowing a~key from a~sibling on deletion from a~$(2,3)$-tree}
    
    In each step of the algorithm, we either produce a~node which is not undersized,
    or we fix the underflow by borrowing from a~sibling, or we merge two nodes. In the
    first two cases, we stop. In the third case, we continue by deleting a~key on the
    next higher level, continuing possibly to the root. If the root becomes
    undersized, it has no keys left, in which case we make its only child the new
    root.
    
    In the worst case, \alg{Delete} visits $\Theta(1)$ nodes on each level and it
    spends $\Theta(b)$ time per node. Its time complexity is therefore $\Theta(b
    \cdot \log n / \log a)$.
    
    \subsection{The choice of parameters}
    
    For a~fixed~$a$, the complexity of all three operations increases with~$b$,
    so we should set $b$~small. The usual choices are the minimum value $2a-1$
    allowed by the definition or $2a$. As we will see in the following sections,
    the latter value has several advantages.
    
    If $b\in\Theta(a)$, the complexity of \alg{Find} becomes $\Theta(\log n)$.
    Both \alg{Insert} and \alg{Delete} will run in time $\Theta(\log n \cdot (a/\log a))$.
    We can therefore conclude that we want to set $a$ to~2 or possibly to another small
    constant. The best choices are the $(2,3)$-tree and the $(2,4)$-tree.
    
    This is true on the RAM, or on any machine with ideal random-access memory.
    If we store the tree on a~disk, the situation changes. A~disk is divided to blocks
    (the typical size of a~block is on the order of kilobytes) and I/O is performed on
    whole blocks. Reading a~single byte is therefore as expensive as reading the full
    block. In this situation, we will set~$a$ to make the size of a~node match the size
    of the block. Consider an example with $4\,{\rm KB}$ blocks, 32-bit keys, and 32-bit pointers.
    If we use the $(256,511)$-tree, one node will fit in a~block and the tree will be
    very shallow: four levels suffice for storing more than 33~million keys. Furthermore,
    the last level contains only external nodes and we can keep the root cached in memory,
    so each search will read only 2~blocks.
    
    The same principle actually applies to main memory of contemporary computers, too.
    They usually employ a~fast \em{cache} between the processor and the (much slower)
    main memory. The communication between the cache and the memory involves transfer
    of \em{cache lines} --- blocks of typical size around $64\,{\rm B}$. It therefore
    helps if we match the size of nodes with the size of cache lines and we align the start
    of nodes to a~multiple of cache line size. For 32-bit keys and 32-bit pointers,
    we can use a~$(4,7)$-tree.
    
    \subsection{Other versions}
    
    Our definition of $(a,b)$-trees is not the only one used. Some authors prefer
    to store useful keys only on the last level and let the other internal nodes
    contain copies of these keys (typically minima of subtrees) for easy
    navigation. This requires minor modifications to our procedures, but the
    asymptotics stay the same. This version can be useful in databases, which usually
    associate potentially large data with each key. They have two different formats
    of nodes, possibly with different choices of $a$ and~$b$: leaves, which contain
    keys with associated data, and internal nodes with keys and pointers.
    
    Database theorists often prefer the name \em{B-trees.} There are many definitions
    of B-trees in the wild, but they are usually equivalent to $(a,2a-1)$-trees or
    $(a,2a)$-trees, possibly modified as in the previous paragraph.
    
    \section{Amortized analysis}
    
    In an~amortized sense, $(a,b)$-trees cannot perform better than in $\Omega(\log n)$,
    because all operations involve search, which can take $\O(\log n)$ repeatedly. However,
    if we already know the location of the key in the tree, the rest of the operation
    amortizes well. In particular, the amortized number of modified nodes is constant.
    
    \theorem{A~sequence of $m$~\alg{Insert}s on an~initially empty $(a,b)$-tree
    performs $\O(m)$ node modifications.}
    
    \proof
    Each \alg{Insert} performs a~(possibly empty) sequence of node splits and then
    it modifies one node. A~split modifies two nodes: one existing and one newly created.
    Since each split creates a~new node and the number of nodes never decreases, the
    total number of splits is bounded by the number of nodes at the end of the sequence,
    which cannot exceed~$m$.
    
    So we have $\O(m)$ splits, each modifying $\O(1)$ nodes, and $\O(m)$ modifications
    outside splits.
    \qed
    
    When we start mixing insertions with deletions, the situation becomes more complicated.
    In a~$(a,2a-1)$-tree, it might happen that $\alg{Insert}(x)$ forces splitting up to the root
    and the immediately following $\alg{Delete}(x)$ undoes all the work and reverts the tree
    back to the original state. So we can construct an~arbitrarily long sequence of operations
    which have $\Omega(\log n)$ cost each.
    
    We already encountered this problem with the flexible arrays of section \secref{amortsec} ---
    if the thresholds for stretching and shrinking are too close, the structure tends to oscillate
    expensively between two states. We cured this by allowing the structure extra ``breathing space''.
    
    The same applies to $(a,b)$-trees. When we increase~$b$ to at least~$2a$, it is sufficient
    to avoid oscillations and keep the amortized number of changes constant. For simplicity,
    we will prove this for the specific case $b=2a$.
    
    \theorem{A~sequence of $m$~\alg{Insert}s and \alg{Delete}s on an~initially empty $(a,2a)$-tree
    performs $\O(m)$ node modifications.}
    
    \proof
    We define the cost of an~operation as the number of nodes it modifies.
    We will show that there exists a~potential~$\Phi$ such that the amortized cost
    of splitting and merging with respect to~$\Phi$ is zero or negative and the amortized
    cost of the rest of \alg{Insert} and \alg{Delete} is constant.
    
    The potential will be a~sum of node contributions. Every node with $k$~keys
    contributes $f(k)$ to the potential, where $f$ will be a~certain non-negative
    function. We allow $k$ from $a-2$ to~$b=2a$ to handle nodes, which are temporarily
    overflowed or underflowed. By definition, $\Phi$~is initially zero and always
    non-negative.
    
    The function~$f$ should satisfy the following requirements:
    \list{n.}
    \:$\vert f(i) - f(i+1) \vert \le c$ for some constant~$c$.
    
      This means that if the number of keys in the node changes by~1, the node's
      contribution changes at most by a~constant.
    
    \:$f(2a) \ge f(a) + f(a-1) + c + 1$.
    
      This guarantees free splits: If we split a~node with $2a$~keys, we release
      $f(2a)$ units of potential. We have to spend $f(a)$ and $f(a-1)$ units on the
      two new nodes and at most~$c$ on inserting a~key to the parent node. Still,
      we have one unit to pay for the real cost. (Strictly speaking, the real cost is~3,
      but we can make it~1 by scaling the costs and the potential.)
    
    \:$f(a-2) + f(a-1) \ge f(2a-2) + c + 1$.
    
      This guarantees free merges: We are always merging an underflowing node ($a-2$ keys)
      with a~minimum allowed node ($a-1$ keys). The potential released by removing these nodes
      is spent on creation of the merged node with $2a-2$ keys, removing a~key from the parent
      node and paying for the real cost of the operation.
    \endlist
    
    By little experimentation, we can construct a~function~$f$ meeting these criteria with $c=2$:
    $$\vbox{\offinterlineskip\halign{
    	&\hbox to 4em{\vrule width 0pt height 12pt depth 4pt\hfil$#$\hfil}\vrule\cr
    	k    & a-2 & a-1 & a & \ldots & 2a-2 & 2a-1 & 2a \cr
    	\noalign{\hrule}
    	f(k) & 2   & 1   & 0 & \ldots & 0    & 2    & 4  \cr
    }}$$
    The function is zero between $a$ and $2a-2$.
    
    An~\alg{Insert} begins by adding the new key, which has $\O(1)$ real cost and it
    changes the potential by $\O(1)$. Then it performs a~sequence of splits, each with
    zero amortized cost.
    
    A~\alg{Delete} removes the key, which has $\O(1)$ cost both
    real and amortized. If the node was undersized, it can perform a~sequence of merges
    with zero amortized cost. Finally, it can borrow a~key from a~neighbor, which has
    $\O(1)$ real and amortized cost, but this happens at most once per \alg{Delete}.
    
    We conclude that the amortized cost of both operations is constant. As the potential
    is non-negative and initially zero, this guarantees that the total real cost of all
    operations is $\O(m)$.
    \qed
    
    % \subsection{A-sort}
    
    \section{Top-down (a,b)-trees and parallel access}
    
    TODO
    
    % \section{Red-black trees}
    
    \endchapter