SUNDIALS Time Integrators for Exascale Applications with Many Independent ODE Systems

Cody J. Balos; Marc Day; Lucas Esclapez; Anne M. Felden; David J. Gardner; Malik Hassanaly; Daniel R. Reynolds; Jon Rood; Jean M. Sexton; Nicholas T. Wimer; Carol S. Woodward

SUNDIALS Time Integrators for Exascale Applications with Many Independent ODE Systems

Cody J. Balos, Marc Day, Lucas Esclapez, Anne M. Felden, David J. Gardner, Malik Hassanaly, Daniel R. Reynolds, Jon Rood, Jean M. Sexton, Nicholas T. Wimer, Carol S. Woodward

TL;DR

This work demonstrates how batched, GPU-optimized time integration within SUNDIALS can efficiently solve the many independent ODEs arising from operator-split PDEs in exascale combustion and cosmology codes (Pele and Nyx). By exploiting CVODE/ARKODE in a batched, per-cell ODE framework and aligning data layout, tolerances, and linear solvers to GPU architectures, the authors achieve substantial performance gains and scalability on Frontier, Summit, and Perlmutter. Key contributions include dynamic tolerance strategies based on typical state scales, batched and memory-pool–driven memory management, and domain-specific optimizations such as data ordering, kernel fusion, tiling, and batched linear solves. The results indicate the approach is ready for exascale workloads, enabling efficient, scalable simulations of complex multiphysics systems with thousands to millions of independent ODEs across entire computational domains.

Abstract

Many complex systems can be accurately modeled as a set of coupled time-dependent partial differential equations (PDEs). However, solving such equations can be prohibitively expensive, easily taxing the world's largest supercomputers. One pragmatic strategy for attacking such problems is to split the PDEs into components that can more easily be solved in isolation. This operator splitting approach is used ubiquitously across scientific domains, and in many cases leads to a set of ordinary differential equations (ODEs) that need to be solved as part of a larger "outer-loop" time-stepping approach. The SUNDIALS library provides a plethora of robust time integration algorithms for solving ODEs, and the U.S. Department of Energy Exascale Computing Project (ECP) has supported its extension to applications on exascale-capable computing hardware. In this paper, we highlight some SUNDIALS capabilities and its deployment in combustion and cosmology application codes (Pele and Nyx, respectively) where operator splitting gives rise to numerous, small ODE systems that must be solved concurrently.

SUNDIALS Time Integrators for Exascale Applications with Many Independent ODE Systems

TL;DR

Abstract

Paper Structure (25 sections, 10 equations, 16 figures, 1 table)

This paper contains 25 sections, 10 equations, 16 figures, 1 table.

Introduction
Overview of SUNDIALS Strategy for Independent ODE Systems on GPUs
Analyzing the Applications
Pele
Nyx
Common Themes and Differences
Improving Performance
Improving performance in Pele
Load balancing
Data Ordering
Selecting Integrator Tolerances
Solver Robustness
Explicit Methods
Batched Linear Solvers
Improving performance in Nyx
...and 10 more sections

Figures (16)

Figure 1: a) CY-ordering for the state vector, where all the state components of a given cell are adjacent in the state vector, along with a schematic of the Jacobian pattern. b) YC-ordering, where adjacent entries contain a given state component for all the cells followed by another component for all the cells, along with the Jacobian pattern.
Figure 2: Dynamically setting the CVODE absolute tolerances by sampling typical values in Pele results in better accuracy and is faster than using a fixed value for the tolerance in nearly every scenario tested. The contour plots (a) and (b) show the average mean-square error of the zero-dimensional reactor temperature profile against $\eta$ (see Equation \ref{['eq:eta']}) and the fluid timestep, $dt_{\textit{CFD}}$, for solution approaches 1A and 1B from Table \ref{['tab:sol_app']}. The contour plots (c) and (d) show the execution time. Gray indicates that the run timed out, or that the error was so large that there was no ignition of the reactants. The typical $dt_{\textit{CFD}}$ for PeleC (red dashed line) and PeleLMeX (blue dashed line) are shown for reference.
Figure 3: PeleC timings with CVODE on a pre-mixed flame test problem on AMD MI100 GPUs using three of the solution approaches from Table \ref{['tab:sol_app']}. Approach 1A uses GMRES with fixed absolute tolerances, approach 1B also uses GMRES but additionally uses typical value absolute tolerances, and 2B uses a direct solver (MAGMA) and typical value absolute tolerance. All three approaches use shared memory reductions. Using typical value absolute tolerances improves the performance by 25% while the direct solver gives the fastest results.
Figure 4: Using a direct linear solver with either a numerically computed or an analytically computed Jacobian is more robust than a Newton-Krylov (NK) approach for $dt_{CFD}$ in the PeleLMeX regime (near the blue dashed line). I.e., it allows the integrator to meet the error requirements with a timestep that is larger and therefore the overall integration is faster than with NK and GMRES. In the PeleC regime (near the red dashed line) NK with GMRES is effective due to the small time scales. The top row shows the averaged mean squared error for the 0D reactor test problem when using solution approaches 1B, 2B, and 3B from Table \ref{['tab:sol_app']} respectively. The bottom row shows the execution time. In this case, gray indicates the run timed out, or that the error was so large that the reactants did not ignite.
Figure 5: Comparison timings for an explicit method from ARKODE and the implicit methods from CVODE on a pre-mixed flame test case in PeleC. While the explicit method is less expensive per step, the greater stability of the implicit method enables it to take far fewer steps leading to a significantly faster runtime.
...and 11 more figures

SUNDIALS Time Integrators for Exascale Applications with Many Independent ODE Systems

TL;DR

Abstract

SUNDIALS Time Integrators for Exascale Applications with Many Independent ODE Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (16)