Exploiting mesh structure to improve multigrid performance for saddle point problems
Lukas Spies, Luke Olson, Scott MacLachlan
TL;DR
The paper addresses efficient solution of saddle-point systems from finite-element discretizations of incompressible flow, focusing on Stokes equations discretized with a $Q2$-$Q1$ Taylor–Hood scheme. It compares block-triangular preconditioners with monolithic multigrid, examining Braess-Sarazin, Vanka, and Schur-Uzawa relaxation, and demonstrates that exploiting structured mesh data and careful GPU implementation can make monolithic multigrid with Vanka or Braess-Sarazin superior to block-factorization approaches. Through a detailed kernel-level performance study on GPU and CPU hardware, it shows that Vanka relaxation is particularly effective on GPUs when memory movement is optimized via patch sharing and shared memory, outperforming Braess-Sarazin by over 20% in some GPU runs. The work provides a roadmap for implementing high-performance monolithic multigrid solvers on modern architectures and highlights the practical importance of memory-bound optimization for saddle-point problems in scientific computing.
Abstract
In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a block-triangular preconditioner and a monolithic multigrid method with the three most common choices of relaxation scheme - Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show monolithic multigrid can outperform block-triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient. Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to $100\%$ slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than $20\%$ on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.
