Skip to content
Snippets Groups Projects
Commit 67e39491 authored by Martin Mareš's avatar Martin Mareš
Browse files

(a,b)-trees: init

parent 951f4821
No related branches found
No related tags found
No related merge requests found
TOP=..
PICS=ab-example
include ../Makerules
import ads;
import trees;
/* První strom */
pair u[];
real s = 1;
u[0] = (0, 0);
u[1] = u[0] + (-1.2, -s);
u[2] = u[0] + (0, -s);
u[3] = u[0] + (1.2, -s);
u[4] = u[1] + (-0.4, -s);
u[5] = u[1] + (0, -s);
u[6] = u[1] + (0.4, -s);
u[7] = u[2] + (-0.3, -s);
u[8] = u[2] + (0.3, -s);
u[9] = u[3] + (-0.3, -s);
u[10] = u[3] + (0.3, -s);
tree_init(u);
real d = 0.1;
real dd = 0.18;
ab_edge(0, 1, -dd);
ab_edge(0, 2);
ab_edge(0, 3, dd);
ab_edge(1, 4, -dd);
ab_edge(1, 5);
ab_edge(1, 6, dd);
ab_edge(2, 7, -d);
ab_edge(2, 8, d);
ab_edge(3, 9, -d);
ab_edge(3, 10, d);
tree_elliptic_node(0, "4\;7");
tree_elliptic_node(1, "1\;3");
tree_elliptic_node(2, "6");
tree_elliptic_node(3, "9");
for (int i=4; i<=10; ++i)
tree_ext(i);
/* Druhý strom */
pair v[];
real s = 1;
v[0] = (4.5, 0);
v[1] = v[0] + (-1.2, -s);
v[2] = v[0] + (0, -s);
v[3] = v[0] + (1.2, -s);
v[4] = v[1] + (-0.3, -s);
v[5] = v[1] + (0.3, -s);
v[6] = v[2] + (-0.3, -s);
v[7] = v[2] + (0.3, -s);
v[8] = v[3] + (-0.4, -s);
v[9] = v[3] + (0, -s);
v[10] = v[3] + (0.4, -s);
tree_init(v);
real d = 0.1;
real dd = 0.18;
ab_edge(0, 1, -dd);
ab_edge(0, 2);
ab_edge(0, 3, dd);
ab_edge(1, 4, -d);
ab_edge(1, 5, d);
ab_edge(2, 6, -d);
ab_edge(2, 7, d);
ab_edge(3, 8, -dd);
ab_edge(3, 9 );
ab_edge(3, 10, dd);
tree_elliptic_node(0, "3\;6");
tree_elliptic_node(1, "1");
tree_elliptic_node(2, "4");
tree_elliptic_node(3, "7\;9");
for (int i=4; i<=10; ++i)
tree_ext(i);
\ifx\chapter\undefined
\input adsmac.tex
\singlechapter{3}
\fi
\chapter[abtree]{(a,b)-trees}
In this chapter, we will study an extension of binary search trees.
It will store multiple keys per node, so the nodes will have more than
two children. The structure of such trees will be more complex, but we
will gain much more straightforward balancing operations.
\section{Worst-case bounds}
\defn{
A~\df{multi-way search tree} is a~rooted tree with specified order
of children in every node. Nodes are divided to internal and external.
Each~\em{internal node} contains one or more distinct keys, stored in
increasing order. A~node with keys $x_1 < \ldots < x_k$ has $k+1$ children
$s_0,\ldots,s_k$. The keys in the node separate keys in the corresponding
subtrees. More formally, if we extend the keys by sentinels $x_0=-\infty$ and
$x_{k+1}=+\infty$, every key~$y$ in the subtree $T(s_i)$ satisfies $x_i < y < x_i$.
\em{External nodes} carry no data and they have no children. These are the
leaves of the tree. In a~program, we can represent them by null pointers.
}
\obs{
Searching in multi-way trees is similar to binary trees. We start at the root.
In each node, we compare the desired key with all keys of the node. Then we either
finish or continue in a~uniquely determined subtree. When we reach an external node,
we conclude that the requested key is not present.
The universe is split to open intervals by the keys of the tree. There is a~1-to-1
correspondence between these intervals and external nodes of the tree. This means
that searches for all keys from the same interval end in the same extrnal node.
}
As in binary search trees, multiway-trees can become degenerate. We therefore need
to add further invariants to keep the trees balanced.
\defn{
An~\em{(a,b)-tree} for parameters $a\ge 2$ and $b\ge 2a-1$ is a~multi-way
search tree, which satisfies:
\tightlist{n.}
\:The root has between 2 and~$b$ children (unless it is a~leaf, which happens
when the set of keys is empty). Every other internal node
has between $a$ and~$b$ children.
\:All external nodes have the same depth.
\endlist
}
The requirements on $a$ and~$b$ may seem mysterious, but they will become
clear when we define operations.
The minimum working type of $(a,b)$-trees are $(2,3)$-trees --- see
figure~\figref{abexample} for examples. Internal nodes are drawn round,
external nodes are square.
\figure[abexample]{ab-example.pdf}{}{Two $(2,3)$-trees for the same set of keys}
Unlike in other texts, we will \em{not} assume that $a$ and~$b$ are constants,
which can be hidden in the~$\O$. This will help us to examine how the performance
of the structure depends on these parameters.
We will start with bounding the height of $(a,b)$-trees. Height will be measured
in edges, a~tree of height~$h$ has the root at depth~0 and external nodes at depth~$h$.
We will call the set of nodes at depth~$i$ the $i$-th \em{level.}
\lemma{
The height of an~$(a,b)$-tree with $n$~keys lies between
$\log_b(n+1)$ and $1 + \log_a((n+1)/2)$.
}
\proof
Let us start with the upper bound. We will calculate the minimum number of keys
in a~tree of height~$h$. The minimum will be attained by a~tree where each node
contains the minimum possible number of keys, so it has the minimum possible number
of children. Level~0 contains only the root with 1~key. Level~$h$ contains only
external nodes. The $i$-th level inbetween contains $2a^{i-1}$ nodes with $a-1$ keys
each. Summing over levels, we get:
$$
1 + \sum_{i=1}^{h-1} 2a^{i-1}(a-1)
= 1 + 2(a-1) \sum_{i=0}^{h-2} a^i
= 1 + 2(a-1) {(a^{h-1}-1)\over (a-1)}
= 2a^{h-1} - 1.
$$
So $n \ge 2a^{h-1} - 1$ in any tree of height~$h$. Solving this for~$h$ yields
$h\le 1 + \log_a((n+1)/2)$.
For the lower bound, we consider the maximum number of keys for height~$h$.
All nodes will contain the highest possible number of keys. Thus we will have
$b^i$~nodes at level~$i$, each with $b-1$ keys. In total, the number of keys will reach
$$
\sum_{i=0}^{h-1} b^i(b-1)
= (b-1) \sum_{i=0}^{h-1} b^i
= (b-1) \cdot {b^h-1 \over b-1}
= b^h - 1.
$$
Therefore in each tree, we have $n \le b^h-1$, so $h \ge \log_b (n+1)$.
\qed
\corr{The height is $\Omega(\log_b n)$ and $\O(\log_a n)$.}
\subsection{Searching for a~key}
$\alg{Find}(x)$ follows the general algorithm for multi-way trees.
It visits $\O(\log_a n) = \O(\log n/\log a)$ nodes. In each node, it compares~$x$ with
all keys of the node, which can be performed in time $\O(\log b)$ by binary search.
In total, we spend time $\Theta(\log n \cdot \log b / \log a)$.
If $b$ is~polynomial in~$a$, the ratio of logarithms is $\Theta(1)$, so the complexity
of \alg{Find} is $\Theta(\log n)$. This is optimum since each non-final comparison brings
at most 1~bit of information and we need to gather $\log n$ bits to determine the result.
\subsection{Insertion}
TODO
An~\alg{Insert} takes $\Theta(b \cdot \log n / \log a)$ time.
\subsection{Deletion}
TODO
A~\alg{Delete} takes $\Theta(b \cdot \log n / \log a)$ time.
\subsection{The choice of parameters}
For a~fixed~$a$, the complexity of all three operations increases with~$b$,
so we should set $b$~small. The usual choices are the minimum value $2a-1$
allowed by the definition or $2a$. As we will see in the following sections,
the latter value has several advantages.
If $b\in\Theta(a)$, the complexity of \alg{Find} becomes $\Theta(\log n)$.
Both \alg{Insert} and \alg{Delete} will run in time $\Theta(\log n \cdot (a/\log a))$.
We can therefore conclude that we want to set $a$ to~2 or possibly to another small
constant. The best choices are the $(2,3)$-tree and the $(2,4)$-tree.
This is true on the RAM, or on any machine with ideal random-access memory.
If we store the tree on a~disk, the situation changes. A~disk is divided to blocks
(the typical size of a~block is on the order of kilobytes) and I/O is performed on
whole blocks. Reading a~single byte is therefore as expensive as reading the full
block. In this situation, we will set~$a$ to make the size of a~node match the size
of the block. Consider an example with $4\,{\rm KB}$ blocks, 32-bit keys, and 32-bit pointers.
If we use the $(256,511)$-tree, one node will fit in a~block and the tree will be
very shallow: four levels suffice for storing more than 33~million keys. Furthermore,
the last level contains only external nodes and we can keep the root cached in memory,
so each search will read only 2~blocks.
The same principle actually applies to main memory of contemporary computers, too.
They usually employ a~fast \em{cache} between the processor and the (much slower)
main memory. The communication between the cache and the memory involves transfer
of \em{cache lines} --- blocks of typical size around $64\,{\rm B}$. It therefore
helps if we match the size of nodes with the size of cache lines and we align the start
of nodes to a~multiple of cache line size. For 32-bit keys and 32-bit pointers,
we can use a~$(4,7)$-tree.
\subsection{Other versions}
Our definition of $(a,b)$-trees is not the only one used. Some authors prefer
to store useful keys only on the last level and let the other internal nodes
contain copies of these keys (typically minima of subtrees) for easy
navigation. This requires minor modifications to our procedures, but the
asymptotics stay the same. This version can be useful in databases, which usually
associate potentially large data with each key. They have two different formats
of nodes, possibly with different choices of $a$ and~$b$: leaves, which contain
keys with associated data, and internal nodes with keys and pointers.
Database theorists often prefer the name \em{B-trees.} There are many definitions
of B-trees in the wild, but they are usually equivalent to $(a,2a-1)$-trees or
$(a,2a)$-trees, possibly modified as in the previous paragraph.
\section{Amortized analysis}
TODO
\theorem{A~sequence of $m$~\alg{Insert}s on an~initially empty $(a,b)$-tree
performs $\O(m)$ node modifications.}
\theorem{A~sequence of $m$~\alg{Insert}s and \alg{Delete}s on an~initially empty $(a,2a)$-tree
performs $\O(m)$ node modifications.}
\subsection{A-sort}
TODO
\section{Top-down (a,b)-trees and parallel access}
\section{Red-black trees}
TODO
\endchapter
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment