Automated MPI-X code generation for scalable finite-difference solvers

George Bisbas; Rhodri Nelson; Mathias Louboutin; Fabio Luporini; Paul H. J. Kelly; Gerard Gorman

Automated MPI-X code generation for scalable finite-difference solvers

George Bisbas, Rhodri Nelson, Mathias Louboutin, Fabio Luporini, Paul H. J. Kelly, Gerard Gorman

TL;DR

This work addresses scalable PDE solving with explicit finite-difference stencils by introducing automated MPI-based code generation integrated into the Devito DSL. The authors implement a complete end-to-end workflow that starts from high-level symbolic PDEs and yields HPC-ready DMP code with zero-change requirements for users, across CPU and GPU clusters. Core contributions include a novel distributed-memory pipeline with domain decomposition, data-region reasoning, halo-exchange optimization, and three computation/communication patterns (basic, diagonal, full), plus support for sparse data. They validate the approach on four geophysics-relevant wave kernels, demonstrating competitive strong and weak scaling and substantial throughput gains over conventional baselines. The framework is open-source and designed to generalize to other PDEs and DSLs, reducing development effort for large-scale scientific simulations.

Abstract

Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to execute explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modeling simulations for real-world applications at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive strong and weak scaling on CPU and GPU clusters, proving its effectiveness and capability to meet the demands of large-scale scientific simulations.

Automated MPI-X code generation for scalable finite-difference solvers

TL;DR

Abstract

Paper Structure (33 sections, 4 equations, 12 figures, 2 tables)

This paper contains 33 sections, 4 equations, 12 figures, 2 tables.

Introduction
The Devito DSL and compiler framework
Automated Distributed-memory Parallelism
Domain decomposition
Data access (read/write)
Sparse data
Access alignment
Data regions
Detecting halo exchanges
Building and optimizing halo exchanges
Computation/Communication patterns
Performance evaluation
Hardware
CPU cluster
GPU cluster
...and 18 more sections

Figures (12)

Figure 1: A high-level overview of the Devito compilation framework: from high-level symbolic maths to HPC-ready optimized code.
Figure 2: Users can tailor the domain decomposition using various configurations, such as (4,2,2), (2,2,4), and (4,4,1), each suitable for 16 MPI ranks.
Figure 3: The compiler analyzes data dependencies to schedule the ownership of non-aligned sparse points. Points at shared boundaries are scheduled to the respective involved ranks.
Figure 4: Aliases for data regions facilitate reasoning about and modeling halo exchanges
Figure 5: Different colors indicate data owned and exchanged by different ranks. Matching colors on different ranks shows the data updated from neighbors. Basic mode communicates exchanges in a multi-step synchronous manner. Diagonal performs additional diagonal communications. Full mode performs communication/computation overlap. The domain is split into CORE and REMAINDER (R) areas. REMAINDER areas are communicated asynchronously with the CORE computation.
...and 7 more figures

Automated MPI-X code generation for scalable finite-difference solvers

TL;DR

Abstract

Automated MPI-X code generation for scalable finite-difference solvers

Authors

TL;DR

Abstract

Table of Contents

Figures (12)