Skip to content
Snippets Groups Projects
Commit 974efeab authored by Martin Mareš's avatar Martin Mareš
Browse files

Merge branch 'master' of gitlab.kam.mff.cuni.cz:mj/dsbook

parents b63952d4 e901e8a6
Branches
No related tags found
No related merge requests found
......@@ -214,9 +214,7 @@ of the block size~$B$. If it is so, we can align the start of the matrix to the
beginning of a~block, so the start of each row will be also aligned. If we set $d=B$,
every tile will be also aligned and each row of the tile will be a~complete block.
If we have enough cache, we can process a~tile in $\O(B)$ I/O operations. As we have
$\O(N^2/B + 1)$ tiles, the total I/O complexity is $\O(N^2/B + B)$. As usually, this
can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
in cases where the whole matrix is smaller than a~single block.
$N^2/B^2$ tiles, the total I/O complexity is $\O(N^2/B)$.
For this algorithm to work, the cache must be able to hold two tiles at once. Since each tile
contains $B^2$ items, this means $M \ge 2B^2$. An~inequality of this kind is usually
......@@ -229,10 +227,14 @@ in the cache and the I/O complexity of our algorithm will not change asymptotica
Now, what if $N$ is not divisible by~$B$? We lose all alignment, but we will prove
that the algorithm still works. Consider a~$B\times B$ tile. In the worst case, each row
spans 2~blocks. So we need $2B$ I/O operations to read it to cache, which is still $\O(B)$.
spans 2~blocks. So we need $2B$ I/O operations to read it into cache, which is still $\O(B)$.
The cache must contain at least $4B^2$ items, but this is still within limits of our tall-cache
assumption.
To process all $\O(N^2/B^2+1)$ tiles, we need $\O(N^2/B + B)$ operations. As usually, this
can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
in cases where the whole matrix is smaller than a~single block.
We can conclude that in the cache-aware model, we can transpose a~$N\times N$ matrix
in time $\Theta(N^2)$ with $\O(N^2/B + 1)$ block transfers. This is obviously optimal.
......@@ -274,7 +276,7 @@ whole algorithm finishes in $\O(N^2)$ steps.
To analyze I/O complexity, we focus on the highest level, at which the sub-problems correspond
to tiles from the previous algorithm. Specifically, we will find the smallest~$i$ such that
the sub-problem size $d = N^2/i$ is at most~$B$. Unless the whole input is small and $i=0$,
the sub-problem size $d = N/2^i$ is at most~$B$. Unless the whole input is small and $i=0$,
this implies $2d = N/2^{i-1} > B$. Therefore $B/2 < d \le B$.
To establish an upper bound on the optimal number of block transfers, we show a~concrete
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment