Table of Contents
Fetching ...

Multilevel Interior Penalty Methods on GPUs

Cu Cui, Guido Kanschat

TL;DR

This work addresses efficiently solving high-order discontinuous Galerkin discretizations of the Poisson problem using matrix-free geometric multigrid on GPUs. It introduces a vertex-patch smoother and fast diagonalization for local inverses, along with a patch-wise integration strategy that preserves tensor-product structure and reduces arithmetic and memory access. The GPU implementation emphasizes data-layout, memory coalescing, bank-conflict avoidance, and mixed-precision computation, plus MPI parallelization across multiple GPUs. Results show up to 39% of the Nvidia ${A100}$ peak flop rate and substantial speedups (up to 90%) from mixed precision, with strong and weak scalability demonstrated across multi-GPU configurations. The approach offers practical guidelines for robust, scalable high-order DG solvers on modern GPU architectures.

Abstract

We present a matrix-free multigrid method for high-order discontinuous Galerkin (DG) finite element methods with GPU acceleration. A performance analysis is conducted, comparing various data and compute layouts. Smoother implementations are optimized through localization and fast diagonalization techniques. Leveraging conflict-free access patterns in shared memory, arithmetic throughput of up to 39% of the peak performance on Nvidia A100 GPUs are achieved. Experimental results affirm the effectiveness of mixed-precision approaches and MPI parallelization in accelerating algorithms. Furthermore, an assessment of solver efficiency and robustness is provided across both two and three dimensions, with applications to Poisson problems.

Multilevel Interior Penalty Methods on GPUs

TL;DR

This work addresses efficiently solving high-order discontinuous Galerkin discretizations of the Poisson problem using matrix-free geometric multigrid on GPUs. It introduces a vertex-patch smoother and fast diagonalization for local inverses, along with a patch-wise integration strategy that preserves tensor-product structure and reduces arithmetic and memory access. The GPU implementation emphasizes data-layout, memory coalescing, bank-conflict avoidance, and mixed-precision computation, plus MPI parallelization across multiple GPUs. Results show up to 39% of the Nvidia peak flop rate and substantial speedups (up to 90%) from mixed precision, with strong and weak scalability demonstrated across multi-GPU configurations. The approach offers practical guidelines for robust, scalable high-order DG solvers on modern GPU architectures.

Abstract

We present a matrix-free multigrid method for high-order discontinuous Galerkin (DG) finite element methods with GPU acceleration. A performance analysis is conducted, comparing various data and compute layouts. Smoother implementations are optimized through localization and fast diagonalization techniques. Leveraging conflict-free access patterns in shared memory, arithmetic throughput of up to 39% of the peak performance on Nvidia A100 GPUs are achieved. Experimental results affirm the effectiveness of mixed-precision approaches and MPI parallelization in accelerating algorithms. Furthermore, an assessment of solver efficiency and robustness is provided across both two and three dimensions, with applications to Poisson problems.
Paper Structure (18 sections, 19 equations, 17 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 19 equations, 17 figures, 4 tables, 1 algorithm.

Figures (17)

  • Figure 1: Compute pattern for patch-wise integrals in 2D. Orange indicates read/write access by the current patch, yellow read only. Gary indicates a contribution of marked faces to orange cell(s). From left to right: center patch, top boundary patch, right boundary patch and corner patch.
  • Figure 2: Left: domain of dependence $\overline V_j$ (brown) and the support of $V_j$ (blue) of the vertex patch solver for the discontinuous Galerkin method. Right: the same for the continuous method where fading indicates that degrees on the boundary of the patch are not part of $V_j$.
  • Figure 3: Data access pattern for the Dirichlet and clamped kernels, respectively, from left to right with polynomial degree $k = 4$. Disks indicate read/write access corresponding to the space $V_j$, circles read only, corresponding to $\overline V_j/V_j$.
  • Figure 4: Non-overlapping coloring for vertex patches on regular meshes. In each color we strive to obtain a tiling of the entire domain. Subsequent colors are obtained by shifting the first patch by one cell in each coordinate direction.
  • Figure 5: Comparison of performance of one smoothing step (left) and of the GMRES solver (right) for different kernels with 134--453 million DoFs in 3D.
  • ...and 12 more figures