Batched DGEMMs for scientific codes running on long vector architectures

Fabio Banchelli; Marta Garcia-Gasulla; Filippo Mantovani

Batched DGEMMs for scientific codes running on long vector architectures

Fabio Banchelli, Marta Garcia-Gasulla, Filippo Mantovani

TL;DR

Problem: SeisSol performance is constrained by limited vector utilization on long-vector architectures. Approach: a portable batched DGEMM library in plain C is developed and integrated into SeisSol to expose ILP on long vector units. Key contributions: API design with compile-time sized kernels, validation showing numerical equivalence within $1.0\times 10^{-5}$, cross-architecture portability, and substantial speedups up to $32.60\times$ over the reference. Significance: demonstrates portable, architecture-friendly batched kernels can dramatically improve performance of scientific codes on CPU long-vector platforms.

Abstract

In this work, we evaluate the performance of SeisSol, a simulator of seismic wave phenomena and earthquake dynamics, on a RISC-V-based system utilizing a vector processing unit. We focus on GEMM libraries and address their limited ability to leverage long vector architectures by developing a batched DGEMM library in plain C. This library achieves speedups ranging from approximately 3.5x to 32.6x compared to the reference implementation. We then integrate the batched approach into the SeisSol application, ensuring portability across different CPU architectures. Lastly, we demonstrate that our implementation is portable to an Intel CPU, resulting in improved execution times in most cases.

Batched DGEMMs for scientific codes running on long vector architectures

TL;DR

, cross-architecture portability, and substantial speedups up to

over the reference. Significance: demonstrates portable, architecture-friendly batched kernels can dramatically improve performance of scientific codes on CPU long-vector platforms.

Abstract

Paper Structure (23 sections, 11 figures, 1 table)

This paper contains 23 sections, 11 figures, 1 table.

Introduction and related work
Background and methodology
Hardware platform
Software environment
Tracing and performance evaluation
Build configuration
SeisSol
Execution structure
DGEMM-based kernels
General structure
Performance out-of-the-box
Batched DGEMMs
Standards and problem constraints
Implementation
Register spilling
...and 8 more sections

Figures (11)

Figure 1: Timeline of six timesteps in SeisSol. Yellow regions corresponds to computeLocalIntegration while red regions correspond to computeNeighboringIntegration.
Figure 2: Performance (GFlop/s) of SeisSol in fpga-sdv using different GEMM libraries.
Figure 3: Vector Length of vfmadd instructions during kernel::derivative::execute1.
Figure 4: Number of register spills when increasing the matrix sizes.
Figure 5: Instruction timeline of 20_9_10_csi (all).
...and 6 more figures

Batched DGEMMs for scientific codes running on long vector architectures

TL;DR

Abstract

Batched DGEMMs for scientific codes running on long vector architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (11)