Table of Contents
Fetching ...

An implementation of tensor product patch smoothers on GPU

Cu Cui, Paul Grosse-Bley, Guido Kanschat, Robert Strzodka

TL;DR

This work addresses efficiently solving Poisson-type problems discretized with high-order tensor-product finite elements on GPUs. It combines a geometric multigrid V-cycle with a tensor-product vertex-patch smoother, matrix-free operator evaluation, and a fast diagonalization local solver to minimize global memory traffic and maximize on-chip data reuse, achieving up to 36% of FP peak on an Nvidia A100. Key contributions include a detailed on-GPU implementation with colorized patch parallelism, multiple kernel variants (Global, Separate, Fused), an analysis of memory-bank conflicts and on-chip bandwidth, and a mixed-precision strategy that significantly speeds up the solve without sacrificing accuracy. The results demonstrate substantial speedups (up to 2x–7x depending on dimension and order) and highlight the practical viability of high-order, tensor-product multigrid on modern GPUs for large-scale problems with hundreds of millions of DoFs.

Abstract

We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global data transfer and a conflict free memory access pattern. Performance tests demonstrate that the optimized kernel is at least 2 times faster than the straightforward implementation for the Poisson problem, across various polynomial degrees in 2D and 3D, achieving up to 36% of the peak performance in both single and double precision on Nvidia A100 GPU.

An implementation of tensor product patch smoothers on GPU

TL;DR

This work addresses efficiently solving Poisson-type problems discretized with high-order tensor-product finite elements on GPUs. It combines a geometric multigrid V-cycle with a tensor-product vertex-patch smoother, matrix-free operator evaluation, and a fast diagonalization local solver to minimize global memory traffic and maximize on-chip data reuse, achieving up to 36% of FP peak on an Nvidia A100. Key contributions include a detailed on-GPU implementation with colorized patch parallelism, multiple kernel variants (Global, Separate, Fused), an analysis of memory-bank conflicts and on-chip bandwidth, and a mixed-precision strategy that significantly speeds up the solve without sacrificing accuracy. The results demonstrate substantial speedups (up to 2x–7x depending on dimension and order) and highlight the practical viability of high-order, tensor-product multigrid on modern GPUs for large-scale problems with hundreds of millions of DoFs.

Abstract

We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global data transfer and a conflict free memory access pattern. Performance tests demonstrate that the optimized kernel is at least 2 times faster than the straightforward implementation for the Poisson problem, across various polynomial degrees in 2D and 3D, achieving up to 36% of the peak performance in both single and double precision on Nvidia A100 GPU.
Paper Structure (20 sections, 23 equations, 20 figures, 7 tables, 4 algorithms)

This paper contains 20 sections, 23 equations, 20 figures, 7 tables, 4 algorithms.

Figures (20)

  • Figure 1: Degrees of freedom including the boundary of the vertex patch, defining $\overline V_j$, and used for the local computations of residuals (left). Interior degrees of freedom involved in the local solver, defining $V_j$ (right).
  • Figure 1: Comparison of implementation of local operation on throughput of one smoothing step in two and three dimensions. Inverse matrix: apply the local solver through multiplication with an inverse matrix. Fast diagonalization: apply the local solver through fast diagonalization.
  • Figure 1: Breakdown of components and throughput of the preconditioned GMRES solver with a vertex-patch smoother in 2D and 3D.
  • Figure 2: Non-overlapping coloring for vertex patches on regular meshes. In each color we strive to obtain a parqueting of the whole domain. Subsequent colors are obtained by shifting the first patch by one cell in each coordinate direction.
  • Figure 2: Visualization of degrees of freedom layout for $\mathbb{Q}_2$ element in 2D, using global lexicographical numbering.
  • ...and 15 more figures