diff --git a/05-cache/cache.tex b/05-cache/cache.tex
index 6a88a6b972fa16bd1f2083b53e54dcec73143d91..a531894555340c3823559b37981338499e5dec33 100644
--- a/05-cache/cache.tex
+++ b/05-cache/cache.tex
@@ -214,9 +214,7 @@ of the block size~$B$. If it is so, we can align the start of the matrix to the
beginning of a~block, so the start of each row will be also aligned. If we set $d=B$,
every tile will be also aligned and each row of the tile will be a~complete block.
If we have enough cache, we can process a~tile in $\O(B)$ I/O operations. As we have
-$\O(N^2/B + 1)$ tiles, the total I/O complexity is $\O(N^2/B + B)$. As usually, this
-can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
-in cases where the whole matrix is smaller than a~single block.
+$N^2/B^2$ tiles, the total I/O complexity is $\O(N^2/B)$.
For this algorithm to work, the cache must be able to hold two tiles at once. Since each tile
contains $B^2$ items, this means $M \ge 2B^2$. An~inequality of this kind is usually
@@ -229,10 +227,14 @@ in the cache and the I/O complexity of our algorithm will not change asymptotica
Now, what if $N$ is not divisible by~$B$? We lose all alignment, but we will prove
that the algorithm still works. Consider a~$B\times B$ tile. In the worst case, each row
-spans 2~blocks. So we need $2B$ I/O operations to read it to cache, which is still $\O(B)$.
+spans 2~blocks. So we need $2B$ I/O operations to read it into cache, which is still $\O(B)$.
The cache must contain at least $4B^2$ items, but this is still within limits of our tall-cache
assumption.
+To process all $\O(N^2/B^2+1)$ tiles, we need $\O(N^2/B + B)$ operations. As usually, this
+can be improved to $\O(N^2/B + 1)$ if we realize that the additional term is required only
+in cases where the whole matrix is smaller than a~single block.
+
We can conclude that in the cache-aware model, we can transpose a~$N\times N$ matrix
in time $\Theta(N^2)$ with $\O(N^2/B + 1)$ block transfers. This is obviously optimal.
@@ -274,7 +276,7 @@ whole algorithm finishes in $\O(N^2)$ steps.
To analyze I/O complexity, we focus on the highest level, at which the sub-problems correspond
to tiles from the previous algorithm. Specifically, we will find the smallest~$i$ such that
-the sub-problem size $d = N^2/i$ is at most~$B$. Unless the whole input is small and $i=0$,
+the sub-problem size $d = N/2^i$ is at most~$B$. Unless the whole input is small and $i=0$,
this implies $2d = N/2^{i-1} > B$. Therefore $B/2 < d \le B$.
To establish an upper bound on the optimal number of block transfers, we show a~concrete