SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

Paul Scheffler; Luca Colagrande; Luca Benini

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

Paul Scheffler, Luca Colagrande, Luca Benini

TL;DR

Stencil codes suffer from memory-access and address-calculation overheads on energy-efficient processors. The paper introduces SARIS, a generic approach that maps grid data accesses through register-mapped indirect streams to decouple memory movement from computation, enabling higher FPU utilization on RISC-V clusters. Evaluations on the open-source Snitch eight-core platform show substantial performance and energy-efficiency gains, and scale-out analysis on a 256-core manycore suggests continued benefits despite memory-system bandwidth constraints. The work contributes a flexible methodology, an open-source baseline plus SARIS-accelerated implementations, and demonstrates near-ideal utilization and competitive peak compute fractions relative to GPU code generators.

Abstract

Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil acceleration using register-mapped indirect streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V compute cluster with indirect stream registers, achieving significant speedups of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency improvements of 1.58x over an RV32G baseline on average. Scaling out to a 256-core manycore system, we estimate an average FPU utilization of 64%, an average speedup of 2.14x, and up to 15% higher fractions of peak compute than a leading GPU code generator.

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

TL;DR

Abstract

Paper Structure (11 sections, 9 figures, 2 tables)

This paper contains 11 sections, 9 figures, 2 tables.

Introduction
Implementation
SARIS Method
Complementary Optimizations
Stencils on a Snitch Cluster
Evaluation
Performance
Energy and Power
Manycore Scaleout
Related work
Conclusion

Figures (9)

Figure 1: Integration of three floating-point . When configured, the reg decode block maps accesses to registers associated with streams to SRs, which act as FIFO interfaces to memory. Addresses are produced by hardware generators.
Figure 2: Visualization and schedule for the symmetric 7-point star stencil code.
Figure 3: Baseline time iteration pseudocode.
Figure 4: Baseline point loop RISC-V assembly.
Figure 5: SARIS time iteration pseudocode.
...and 4 more figures

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

TL;DR

Abstract

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)