A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Ryuichi Sai; John Mellor-Crummey; Jinfan Xu; Mauricio Araya-Polo

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Ryuichi Sai, John Mellor-Crummey, Jinfan Xu, Mauricio Araya-Polo

TL;DR

StencilPy presents a portable, Python-embedded DSL framework to accelerate high-order stencil computations across modern CPUs, GPUs, and accelerators. It combines a multi-layer IR stack with backend-specific code generators (seq, omp, cuda, hip, sycl, stx, CSL) and a JIT launcher to deliver near hand-crafted performance while maintaining portability. The approach is validated on a $25$-point star-shaped stencil for acoustic isotropic wave modeling, demonstrating strong numerical accuracy (max error $\approx 10^{-7}$; RMSD $\approx 10^{-8}$) and competitive runtime performance across platforms, with notable productivity gains (significantly shorter code than hand-written equivalents). The work enables performance portability and developer productivity at scale, with future plans for auto-tuning, frontend extensions, and broader backend support to further reduce development costs for complex stencil-based simulations.

Abstract

Finite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, computational fluid dynamics, and other scientific applications. Achieving HPC-level stencil computations on one architecture is challenging, porting to other architectures without sacrificing performance requires significant effort, especially in this golden age of many distinctive architectures. To help developers achieve performance, portability, and productivity with stencil computations, we developed StencilPy. With StencilPy, developers write stencil computations in a high-level domain-specific language, which promotes productivity, while its backends generate efficient code for existing and emerging architectures, including modern many-core CPUs (such as AMD Genoa-X, Fujitsu A64FX, and Intel Sapphire Rapids), latest generations of GPUs (including NVIDIA H100 and A100, AMD MI200, and Intel Ponte Vecchio), and accelerators (including Cerebras and STX). StencilPy demonstrates promising performance results on par with hand-written code, maintains cross-architectural performance portability, and enhances productivity. Its modular design enables easy configuration, customization, and extension. A 25-point star-shaped stencil written in StencilPy is one-quarter of the length of a hand-crafted CUDA code and achieves similar performance on an NVIDIA H100 GPU. In addition, the same kernel written using our tool is 7x shorter than hand-optimized code written in Cerebras Software Language (CSL), and it delivers comparable performance that code on a Cerebras CS-2.

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

TL;DR

-point star-shaped stencil for acoustic isotropic wave modeling, demonstrating strong numerical accuracy (max error

; RMSD

) and competitive runtime performance across platforms, with notable productivity gains (significantly shorter code than hand-written equivalents). The work enables performance portability and developer productivity at scale, with future plans for auto-tuning, frontend extensions, and broader backend support to further reduce development costs for complex stencil-based simulations.

Abstract

Paper Structure (67 sections, 3 equations, 9 figures, 23 tables, 4 algorithms)

This paper contains 67 sections, 3 equations, 9 figures, 23 tables, 4 algorithms.

Introduction
Background
High-Order Stencil Computations
Seismic Modeling and Acoustic Isotropic Approximation
Related Work
Stencil Optimizations
Stencil DSLs and automated code generations
Software Frameworks in Python
Framework Design and Architecture
DSL and Frontend Design
StencilPy-Specific Constructs
Type Hints Required
Framework Implementation and Optimization
Stencil Performance Optimizations
Workflow
...and 52 more sections

Figures (9)

Figure 1: A star-shaped 25-point stencil.
Figure 2: The StencilPy framework architecture.
Figure 3: IR hierarchy in StencilPy framework.
Figure 4: 3D grid of size ${N_x \times N_y \times N_z}$. X and Y dimensions are mapped onto the PE grid of the WSE, while Z dimension is mapped onto memory of each PE.
Figure 5: Stencil point index pattern used in DFIR and CSL code generation.
...and 4 more figures

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

TL;DR

Abstract

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (9)