Automated MPI-X code generation for scalable finite-difference solvers
George Bisbas, Rhodri Nelson, Mathias Louboutin, Fabio Luporini, Paul H. J. Kelly, Gerard Gorman
TL;DR
This work addresses scalable PDE solving with explicit finite-difference stencils by introducing automated MPI-based code generation integrated into the Devito DSL. The authors implement a complete end-to-end workflow that starts from high-level symbolic PDEs and yields HPC-ready DMP code with zero-change requirements for users, across CPU and GPU clusters. Core contributions include a novel distributed-memory pipeline with domain decomposition, data-region reasoning, halo-exchange optimization, and three computation/communication patterns (basic, diagonal, full), plus support for sparse data. They validate the approach on four geophysics-relevant wave kernels, demonstrating competitive strong and weak scaling and substantial throughput gains over conventional baselines. The framework is open-source and designed to generalize to other PDEs and DSLs, reducing development effort for large-scale scientific simulations.
Abstract
Partial differential equations (PDEs) are crucial in modeling diverse phenomena across scientific disciplines, including seismic and medical imaging, computational fluid dynamics, image processing, and neural networks. Solving these PDEs at scale is an intricate and time-intensive process that demands careful tuning. This paper introduces automated code-generation techniques specifically tailored for distributed memory parallelism (DMP) to execute explicit finite-difference (FD) stencils at scale, a fundamental challenge in numerous scientific applications. These techniques are implemented and integrated into the Devito DSL and compiler framework, a well-established solution for automating the generation of FD solvers based on a high-level symbolic math input. Users benefit from modeling simulations for real-world applications at a high-level symbolic abstraction and effortlessly harnessing HPC-ready distributed-memory parallelism without altering their source code. This results in drastic reductions both in execution time and developer effort. A comprehensive performance evaluation of Devito's DMP via MPI demonstrates highly competitive strong and weak scaling on CPU and GPU clusters, proving its effectiveness and capability to meet the demands of large-scale scientific simulations.
