Table of Contents
Fetching ...

GALÆXI: Solving complex compressible flows with high-order discontinuous Galerkin methods on accelerator-based systems

Daniel Kempf, Marius Kurz, Marcel Blind, Patrick Kopper, Philipp Offenhäuser, Anna Schwarz, Spencer Starr, Jens Keim, Andrea Beck

TL;DR

GALAEXI is renders GALAEXI as a potent tool for accurate and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.

Abstract

This work presents GALAEXI as a novel, energy-efficient flow solver for the simulation of compressible flows on unstructured meshes leveraging the parallel computing power of modern Graphics Processing Units (GPUs). GALAEXI implements the high-order Discontinuous Galerkin Spectral Element Method (DGSEM) using shock capturing with a finite-volume subcell approach to ensure the stability of the high-order scheme near shocks. This work provides details on the general code design, the parallelization strategy, and the implementation approach for the compute kernels with a focus on the element local mappings between volume and surface data due to the unstructured mesh. GALAEXI exhibits excellent strong scaling properties up to 1024 GPUs if each GPU is assigned a minimum of one million degrees of freedom degrees of freedom. To verify its implementation, a convergence study is performed that recovers the theoretical order of convergence of the implemented numerical schemes. Moreover, the solver is validated using both the incompressible and compressible formulation of the Taylor-Green-Vortex at a Mach number of 0.1 and 1.25, respectively. A mesh convergence study shows that the results converge to the high-fidelity reference solution and that the results match the original CPU implementation. Finally, GALAEXI is applied to a large-scale wall-resolved large eddy simulation of a linear cascade of the NASA Rotor 37. Here, the supersonic region and shocks at the leading edge are captured accurately and robustly by the implemented shock-capturing approach. It is demonstrated that GALAEXI requires less than half of the energy to carry out this simulation in comparison to the reference CPU implementation. This renders GALAEXI as a potent tool for accurate and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.

GALÆXI: Solving complex compressible flows with high-order discontinuous Galerkin methods on accelerator-based systems

TL;DR

GALAEXI is renders GALAEXI as a potent tool for accurate and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.

Abstract

This work presents GALAEXI as a novel, energy-efficient flow solver for the simulation of compressible flows on unstructured meshes leveraging the parallel computing power of modern Graphics Processing Units (GPUs). GALAEXI implements the high-order Discontinuous Galerkin Spectral Element Method (DGSEM) using shock capturing with a finite-volume subcell approach to ensure the stability of the high-order scheme near shocks. This work provides details on the general code design, the parallelization strategy, and the implementation approach for the compute kernels with a focus on the element local mappings between volume and surface data due to the unstructured mesh. GALAEXI exhibits excellent strong scaling properties up to 1024 GPUs if each GPU is assigned a minimum of one million degrees of freedom degrees of freedom. To verify its implementation, a convergence study is performed that recovers the theoretical order of convergence of the implemented numerical schemes. Moreover, the solver is validated using both the incompressible and compressible formulation of the Taylor-Green-Vortex at a Mach number of 0.1 and 1.25, respectively. A mesh convergence study shows that the results converge to the high-fidelity reference solution and that the results match the original CPU implementation. Finally, GALAEXI is applied to a large-scale wall-resolved large eddy simulation of a linear cascade of the NASA Rotor 37. Here, the supersonic region and shocks at the leading edge are captured accurately and robustly by the implemented shock-capturing approach. It is demonstrated that GALAEXI requires less than half of the energy to carry out this simulation in comparison to the reference CPU implementation. This renders GALAEXI as a potent tool for accurate and efficient simulations of compressible flows in the realm of exascale computing and the associated new HPC architectures.
Paper Structure (24 sections, 24 equations, 13 figures, 3 tables, 4 algorithms)

This paper contains 24 sections, 24 equations, 13 figures, 3 tables, 4 algorithms.

Figures (13)

  • Figure 1: Perspective sketch of a single DG element in the reference space using Legendre-Gauss interpolation points with $N=2$. Gray cubes indicate the interpolation points within the element, while the gray squares indicate interpolation points on the six local faces called $\xi^{\pm},\eta^{\pm},\zeta^{\pm}$. The linewise operations of the tensor product ansatz are indicated for the center interpolation point, where the operations along the coordinates $\bm{\xi}=(\xi,\eta,\zeta)$ are highlighted in blue, red and green, respectively.
  • Figure 2: Sketch of the sub-cell shock capturing scheme. The DG polynomial using LGL points and a polynomial degree of $N=3$ is shown in black with the interpolation points indicated as dots and the integral mean solution within the subcells is shown in blue. The solution in the neighboring DG elements is indicated in red.
  • Figure 3: Domain decomposition for a generic airfoil simulation with large spanwise extension. The domain is cut such that the airfoil (transparent surface) including the boundary layer part is visible. Patches of different colors represent individual MPI domains that are processed by different ranks. This figure is an example of a fine granularity, e.g. in the CPU case. In the GPU case, larger MPI domains occur.
  • Figure 4: Flowchart of GA-LÆ-XI for a single evaluation of the convective DG operator using streams. Some routines comprise several individual compute kernels instead of single, monolithic device kernels. These are summarized here to keep the flowchart concise. Moreover, the lifting procedure to compute the gradients is omitted here for readability.
  • Figure 5: Portion of compute time in percent for individual routines on HAWK-AI with $N=7$, split-form DG and 8.9e5DOF on a single GPU. Routines associated with the computation of the gradients via the lifting method are prefixed with "Lift_". Various small routines associated with performing the actual time integration, i.e. updating $\bm{U}$, are summarized under Misc.
  • ...and 8 more figures