Blocking for cache In this part, we are using a super simple memory model with cache, so we added an additional blocking for cache. With the kernels we developed in part3, we also used 192*192 blocks.
With this simple technique, we can have a gemm with around 2 GFLOPS. We didn’t optimize the cache size, which means we can potentially improve more by changing the block size.
However, we are still far behind OpenBLAS.
Blocking for registers In this part, we are using a super simple memory model. We assume there is only main memory and registers, and we are loading blocks of A, B and C into registers as shown below:
As we mentioned before, the Raspberry Pi 3 has neon support, so we are loading data into vector register. There are 4 intrinsic functions we need here:
vld1q_f32: it loads a vector into register, which is very similar to _mm256_loadu_pd in AVX.
Besides the nested loop, there are alternative ways to compute GEMM.
GEMM using GEMV GEMM can be computed using GEMV as shown below.
GEMM using rank-1 update (GER) GEMM can also be viewed as a series of rank-1 update (GER) operations.
Performance If we don’t use GEMV or GER from a good BLAS and just use a naive loop, the performance can be even worse than our JPI loop.
Last week I got a Raspberry Pi from one friend, and I am thinking about doing something with this little board.
First, let’s check the CPU spec by cat /proc/cpuinfo.
processor : 0 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4 We can see the neon feture, which is similar with SSE on x86 CPUs.