SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers
Paul Scheffler, Luca Colagrande, Luca Benini
TL;DR
Stencil codes suffer from memory-access and address-calculation overheads on energy-efficient processors. The paper introduces SARIS, a generic approach that maps grid data accesses through register-mapped indirect streams to decouple memory movement from computation, enabling higher FPU utilization on RISC-V clusters. Evaluations on the open-source Snitch eight-core platform show substantial performance and energy-efficiency gains, and scale-out analysis on a 256-core manycore suggests continued benefits despite memory-system bandwidth constraints. The work contributes a flexible methodology, an open-source baseline plus SARIS-accelerated implementations, and demonstrates near-ideal utilization and competitive peak compute fractions relative to GPU code generators.
Abstract
Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil acceleration using register-mapped indirect streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V compute cluster with indirect stream registers, achieving significant speedups of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency improvements of 1.58x over an RV32G baseline on average. Scaling out to a 256-core manycore system, we estimate an average FPU utilization of 64%, an average speedup of 2.14x, and up to 15% higher fractions of peak compute than a leading GPU code generator.
