Merge branch 'master' of gitlab.kam.mff.cuni.cz:mj/dsbook

974efeab · Martin Mareš · b63952d4 · e901e8a6 · 974efeab
Commit 974efeab authored 5 years ago by Martin Mareš
--- a/05-cache/cache.tex
+++ b/05-cache/cache.tex
@@ -214,9 +214,7 @@ of the block size~$B$. If it is so, we can align the start of the matrix to the
 beginning of a~block, so the start of each row will be also aligned. If we set $d=B$,
 every tile will be also aligned and each row of the tile will be a~complete block.
 If we have enough cache, we can process a~tile in $\O(B)$ I/O operations. As we have
-$\O(N^2/B + 1)$ tiles, the total I/O complexity is $\O(N^2/B + B)$. As usually, this
-can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
-in cases where the whole matrix is smaller than a~single block.
+$N^2/B^2$ tiles, the total I/O complexity is $\O(N^2/B)$.

 For this algorithm to work, the cache must be able to hold two tiles at once. Since each tile
 contains $B^2$ items, this means $M \ge 2B^2$. An~inequality of this kind is usually
@@ -229,10 +227,14 @@ in the cache and the I/O complexity of our algorithm will not change asymptotica

 Now, what if $N$ is not divisible by~$B$? We lose all alignment, but we will prove
 that the algorithm still works. Consider a~$B\times B$ tile. In the worst case, each row
-spans 2~blocks. So we need $2B$ I/O operations to read it to cache, which is still $\O(B)$.
+spans 2~blocks. So we need $2B$ I/O operations to read it into cache, which is still $\O(B)$.
 The cache must contain at least $4B^2$ items, but this is still within limits of our tall-cache
 assumption.

+To process all $\O(N^2/B^2+1)$ tiles, we need $\O(N^2/B + B)$ operations. As usually, this
+can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
+in cases where the whole matrix is smaller than a~single block.
+
 We can conclude that in the cache-aware model, we can transpose a~$N\times N$ matrix
 in time $\Theta(N^2)$ with $\O(N^2/B + 1)$ block transfers. This is obviously optimal.

@@ -274,7 +276,7 @@ whole algorithm finishes in $\O(N^2)$ steps.

 To analyze I/O complexity, we focus on the highest level, at which the sub-problems correspond
 to tiles from the previous algorithm. Specifically, we will find the smallest~$i$ such that
-the sub-problem size $d = N^2/i$ is at most~$B$. Unless the whole input is small and $i=0$,
+the sub-problem size $d = N/2^i$ is at most~$B$. Unless the whole input is small and $i=0$,
 this implies $2d = N/2^{i-1} > B$. Therefore $B/2 < d \le B$.

 To establish an upper bound on the optimal number of block transfers, we show a~concrete