GEMM on Raspberry Pi (4)
Blocking for cache
In this part, we are using a super simple memory model with cache, so we added an additional blocking for cache. With the kernels we developed in part3, we also used 192*192 blocks.
With this simple technique, we can have a gemm with around 2 GFLOPS. We didn’t optimize the cache size, which means we can potentially improve more by changing the block size.
However, we are still far behind OpenBLAS.