An implementation of tensor product patch smoothers on GPU
Cu Cui, Paul Grosse-Bley, Guido Kanschat, Robert Strzodka
TL;DR
This work addresses efficiently solving Poisson-type problems discretized with high-order tensor-product finite elements on GPUs. It combines a geometric multigrid V-cycle with a tensor-product vertex-patch smoother, matrix-free operator evaluation, and a fast diagonalization local solver to minimize global memory traffic and maximize on-chip data reuse, achieving up to 36% of FP peak on an Nvidia A100. Key contributions include a detailed on-GPU implementation with colorized patch parallelism, multiple kernel variants (Global, Separate, Fused), an analysis of memory-bank conflicts and on-chip bandwidth, and a mixed-precision strategy that significantly speeds up the solve without sacrificing accuracy. The results demonstrate substantial speedups (up to 2x–7x depending on dimension and order) and highlight the practical viability of high-order, tensor-product multigrid on modern GPUs for large-scale problems with hundreds of millions of DoFs.
Abstract
We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global data transfer and a conflict free memory access pattern. Performance tests demonstrate that the optimized kernel is at least 2 times faster than the straightforward implementation for the Poisson problem, across various polynomial degrees in 2D and 3D, achieving up to 36% of the peak performance in both single and double precision on Nvidia A100 GPU.
