Skip to content
Snippets Groups Projects
Commit 974efeab authored by Martin Mareš's avatar Martin Mareš
Browse files

Merge branch 'master' of gitlab.kam.mff.cuni.cz:mj/dsbook

parents b63952d4 e901e8a6
No related branches found
No related tags found
No related merge requests found
...@@ -214,9 +214,7 @@ of the block size~$B$. If it is so, we can align the start of the matrix to the ...@@ -214,9 +214,7 @@ of the block size~$B$. If it is so, we can align the start of the matrix to the
beginning of a~block, so the start of each row will be also aligned. If we set $d=B$, beginning of a~block, so the start of each row will be also aligned. If we set $d=B$,
every tile will be also aligned and each row of the tile will be a~complete block. every tile will be also aligned and each row of the tile will be a~complete block.
If we have enough cache, we can process a~tile in $\O(B)$ I/O operations. As we have If we have enough cache, we can process a~tile in $\O(B)$ I/O operations. As we have
$\O(N^2/B + 1)$ tiles, the total I/O complexity is $\O(N^2/B + B)$. As usually, this $N^2/B^2$ tiles, the total I/O complexity is $\O(N^2/B)$.
can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
in cases where the whole matrix is smaller than a~single block.
For this algorithm to work, the cache must be able to hold two tiles at once. Since each tile For this algorithm to work, the cache must be able to hold two tiles at once. Since each tile
contains $B^2$ items, this means $M \ge 2B^2$. An~inequality of this kind is usually contains $B^2$ items, this means $M \ge 2B^2$. An~inequality of this kind is usually
...@@ -229,10 +227,14 @@ in the cache and the I/O complexity of our algorithm will not change asymptotica ...@@ -229,10 +227,14 @@ in the cache and the I/O complexity of our algorithm will not change asymptotica
Now, what if $N$ is not divisible by~$B$? We lose all alignment, but we will prove Now, what if $N$ is not divisible by~$B$? We lose all alignment, but we will prove
that the algorithm still works. Consider a~$B\times B$ tile. In the worst case, each row that the algorithm still works. Consider a~$B\times B$ tile. In the worst case, each row
spans 2~blocks. So we need $2B$ I/O operations to read it to cache, which is still $\O(B)$. spans 2~blocks. So we need $2B$ I/O operations to read it into cache, which is still $\O(B)$.
The cache must contain at least $4B^2$ items, but this is still within limits of our tall-cache The cache must contain at least $4B^2$ items, but this is still within limits of our tall-cache
assumption. assumption.
To process all $\O(N^2/B^2+1)$ tiles, we need $\O(N^2/B + B)$ operations. As usually, this
can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
in cases where the whole matrix is smaller than a~single block.
We can conclude that in the cache-aware model, we can transpose a~$N\times N$ matrix We can conclude that in the cache-aware model, we can transpose a~$N\times N$ matrix
in time $\Theta(N^2)$ with $\O(N^2/B + 1)$ block transfers. This is obviously optimal. in time $\Theta(N^2)$ with $\O(N^2/B + 1)$ block transfers. This is obviously optimal.
...@@ -274,7 +276,7 @@ whole algorithm finishes in $\O(N^2)$ steps. ...@@ -274,7 +276,7 @@ whole algorithm finishes in $\O(N^2)$ steps.
To analyze I/O complexity, we focus on the highest level, at which the sub-problems correspond To analyze I/O complexity, we focus on the highest level, at which the sub-problems correspond
to tiles from the previous algorithm. Specifically, we will find the smallest~$i$ such that to tiles from the previous algorithm. Specifically, we will find the smallest~$i$ such that
the sub-problem size $d = N^2/i$ is at most~$B$. Unless the whole input is small and $i=0$, the sub-problem size $d = N/2^i$ is at most~$B$. Unless the whole input is small and $i=0$,
this implies $2d = N/2^{i-1} > B$. Therefore $B/2 < d \le B$. this implies $2d = N/2^{i-1} > B$. Therefore $B/2 < d \le B$.
To establish an upper bound on the optimal number of block transfers, we show a~concrete To establish an upper bound on the optimal number of block transfers, we show a~concrete
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment