Table of Contents
Fetching ...

Batched DGEMMs for scientific codes running on long vector architectures

Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani

TL;DR

Problem: SeisSol performance is constrained by limited vector utilization on long-vector architectures. Approach: a portable batched DGEMM library in plain C is developed and integrated into SeisSol to expose ILP on long vector units. Key contributions: API design with compile-time sized kernels, validation showing numerical equivalence within $1.0\times 10^{-5}$, cross-architecture portability, and substantial speedups up to $32.60\times$ over the reference. Significance: demonstrates portable, architecture-friendly batched kernels can dramatically improve performance of scientific codes on CPU long-vector platforms.

Abstract

In this work, we evaluate the performance of SeisSol, a simulator of seismic wave phenomena and earthquake dynamics, on a RISC-V-based system utilizing a vector processing unit. We focus on GEMM libraries and address their limited ability to leverage long vector architectures by developing a batched DGEMM library in plain C. This library achieves speedups ranging from approximately 3.5x to 32.6x compared to the reference implementation. We then integrate the batched approach into the SeisSol application, ensuring portability across different CPU architectures. Lastly, we demonstrate that our implementation is portable to an Intel CPU, resulting in improved execution times in most cases.

Batched DGEMMs for scientific codes running on long vector architectures

TL;DR

Problem: SeisSol performance is constrained by limited vector utilization on long-vector architectures. Approach: a portable batched DGEMM library in plain C is developed and integrated into SeisSol to expose ILP on long vector units. Key contributions: API design with compile-time sized kernels, validation showing numerical equivalence within , cross-architecture portability, and substantial speedups up to over the reference. Significance: demonstrates portable, architecture-friendly batched kernels can dramatically improve performance of scientific codes on CPU long-vector platforms.

Abstract

In this work, we evaluate the performance of SeisSol, a simulator of seismic wave phenomena and earthquake dynamics, on a RISC-V-based system utilizing a vector processing unit. We focus on GEMM libraries and address their limited ability to leverage long vector architectures by developing a batched DGEMM library in plain C. This library achieves speedups ranging from approximately 3.5x to 32.6x compared to the reference implementation. We then integrate the batched approach into the SeisSol application, ensuring portability across different CPU architectures. Lastly, we demonstrate that our implementation is portable to an Intel CPU, resulting in improved execution times in most cases.
Paper Structure (23 sections, 11 figures, 1 table)

This paper contains 23 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Timeline of six timesteps in SeisSol. Yellow regions corresponds to computeLocalIntegration while red regions correspond to computeNeighboringIntegration.
  • Figure 2: Performance (GFlop/s) of SeisSol in fpga-sdv using different GEMM libraries.
  • Figure 3: Vector Length of vfmadd instructions during kernel::derivative::execute1.
  • Figure 4: Number of register spills when increasing the matrix sizes.
  • Figure 5: Instruction timeline of 20_9_10_csi (all).
  • ...and 6 more figures