Table of Contents
Fetching ...

Performance of a high-order MPI-Kokkos accelerated fluid solver

Filipp Sporykhin, Holger Homann

TL;DR

The paper addresses the challenge of attaining high-fidelity fluid simulations efficiently on modern heterogeneous HPC architectures. It introduces a single-source, high-order NDG solver implemented with Kokkos and MPI, tested with spatial orders up to eight and Runge-Kutta time integrators up to sixth order for 1D–3D linear advection and isothermal Euler equations. The results show that higher spatial order drastically reduces the number of degrees of freedom and computing time required to reach a given accuracy, with GPUs delivering substantial performance gains for large problems and exhibiting strong weak scaling, while energy efficiency depends on grid size and hardware generation. The work demonstrates portability and scalability of a single code base across vendors and highlights environmental considerations, including a rebound effect where newer GPUs require larger grids to maintain energy efficiency benefits.

Abstract

This work discusses the performance of a modern numerical scheme for fluid dynamical problems on modern high-performance computing architectures. Our code implements a spatial nodal discontinuous Galerkin scheme that we test up to an order of convergence of eight. It is temporally coupled to a set of Runge-Kutta methods of orders up to six. The code integrates the linear advection equations as well as the isothermal Euler equations in one, two, and three dimensions. In order to target modern hardware involving many-core Central Processing Units and accelerators such as Graphic Processing Units we use the Kokkos library in conjunction with the Message Passing Interface to run our single source code on various GPU systems. We find that the higher the order the faster is the code. Eighth-order simulations attain a given global error with much less computing time than third- or fourth-order simulations. The RK scheme has a smaller impact on the code performance and a classical fourth-order scheme seems to generally be a good choice. The code performs very well on all considered GPUs. The many-CPU performance is also very good and perfect weak scaling is observed up to many hundreds of CPU cores using MPI. We note that small grid-size simulations are faster on CPUs than on GPUs while GPUs win significantly over CPUs for simulations involving more than $10^7$ degrees of freedom ($\approx 3100^2$ grid points). When it comes to the environmental impact of numerical simulations we estimate that GPUs consume less energy than CPUs for large grid-size simulations but more energy on small grids. We observe a tendency that the more modern is the GPU the larger needs to be the grid in order to use it efficiently. This yields a rebound effect because larger simulations need longer computing times and in turn more energy that is not compensated by the energy efficiency gain of the newer GPUs.

Performance of a high-order MPI-Kokkos accelerated fluid solver

TL;DR

The paper addresses the challenge of attaining high-fidelity fluid simulations efficiently on modern heterogeneous HPC architectures. It introduces a single-source, high-order NDG solver implemented with Kokkos and MPI, tested with spatial orders up to eight and Runge-Kutta time integrators up to sixth order for 1D–3D linear advection and isothermal Euler equations. The results show that higher spatial order drastically reduces the number of degrees of freedom and computing time required to reach a given accuracy, with GPUs delivering substantial performance gains for large problems and exhibiting strong weak scaling, while energy efficiency depends on grid size and hardware generation. The work demonstrates portability and scalability of a single code base across vendors and highlights environmental considerations, including a rebound effect where newer GPUs require larger grids to maintain energy efficiency benefits.

Abstract

This work discusses the performance of a modern numerical scheme for fluid dynamical problems on modern high-performance computing architectures. Our code implements a spatial nodal discontinuous Galerkin scheme that we test up to an order of convergence of eight. It is temporally coupled to a set of Runge-Kutta methods of orders up to six. The code integrates the linear advection equations as well as the isothermal Euler equations in one, two, and three dimensions. In order to target modern hardware involving many-core Central Processing Units and accelerators such as Graphic Processing Units we use the Kokkos library in conjunction with the Message Passing Interface to run our single source code on various GPU systems. We find that the higher the order the faster is the code. Eighth-order simulations attain a given global error with much less computing time than third- or fourth-order simulations. The RK scheme has a smaller impact on the code performance and a classical fourth-order scheme seems to generally be a good choice. The code performs very well on all considered GPUs. The many-CPU performance is also very good and perfect weak scaling is observed up to many hundreds of CPU cores using MPI. We note that small grid-size simulations are faster on CPUs than on GPUs while GPUs win significantly over CPUs for simulations involving more than degrees of freedom ( grid points). When it comes to the environmental impact of numerical simulations we estimate that GPUs consume less energy than CPUs for large grid-size simulations but more energy on small grids. We observe a tendency that the more modern is the GPU the larger needs to be the grid in order to use it efficiently. This yields a rebound effect because larger simulations need longer computing times and in turn more energy that is not compensated by the energy efficiency gain of the newer GPUs.

Paper Structure

This paper contains 10 sections, 22 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Sketch of a linear (first-order) polynomial representation of a sinusoidal $U(x,t)$ profile. Discontinuities can be seen between cells. The $-$ and $+$ signs denote the flux values the left and right polynomial at the intersection of two cells.
  • Figure 2: Initial condition in the case of $N_k=40$.
  • Figure 3: Convergence tests for the 1d advection equation with an initial condition using $N_k=40$ for different orders. All simulations use a sixth-order Runge-Kutta scheme. The straight solid lines indicate the expected large cell number scaling.
  • Figure 4: Required number of degrees of freedom to achieve a given error as a function of the spatial order of the scheme for 1d advection for $N_k=40$. The dashed functions are of the form $c\,(1/error)^{1/p}$ from Kreiss-Oliger, where the constant $c=200$ is the same for all graphs
  • Figure 5: Convergence tests for the 1d advection equation with an initial condition using $N_k=40$. The spatial order of the scheme is six for all runs and the order of the temporal Runge-Kutta scheme is varied.
  • ...and 12 more figures