Batched DGEMMs for scientific codes running on long vector architectures
Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani
TL;DR
Problem: SeisSol performance is constrained by limited vector utilization on long-vector architectures. Approach: a portable batched DGEMM library in plain C is developed and integrated into SeisSol to expose ILP on long vector units. Key contributions: API design with compile-time sized kernels, validation showing numerical equivalence within $1.0\times 10^{-5}$, cross-architecture portability, and substantial speedups up to $32.60\times$ over the reference. Significance: demonstrates portable, architecture-friendly batched kernels can dramatically improve performance of scientific codes on CPU long-vector platforms.
Abstract
In this work, we evaluate the performance of SeisSol, a simulator of seismic wave phenomena and earthquake dynamics, on a RISC-V-based system utilizing a vector processing unit. We focus on GEMM libraries and address their limited ability to leverage long vector architectures by developing a batched DGEMM library in plain C. This library achieves speedups ranging from approximately 3.5x to 32.6x compared to the reference implementation. We then integrate the batched approach into the SeisSol application, ensuring portability across different CPU architectures. Lastly, we demonstrate that our implementation is portable to an Intel CPU, resulting in improved execution times in most cases.
