Exploiting mesh structure to improve multigrid performance for saddle point problems

Lukas Spies; Luke Olson; Scott MacLachlan

Exploiting mesh structure to improve multigrid performance for saddle point problems

Lukas Spies, Luke Olson, Scott MacLachlan

TL;DR

The paper addresses efficient solution of saddle-point systems from finite-element discretizations of incompressible flow, focusing on Stokes equations discretized with a $Q2$-$Q1$ Taylor–Hood scheme. It compares block-triangular preconditioners with monolithic multigrid, examining Braess-Sarazin, Vanka, and Schur-Uzawa relaxation, and demonstrates that exploiting structured mesh data and careful GPU implementation can make monolithic multigrid with Vanka or Braess-Sarazin superior to block-factorization approaches. Through a detailed kernel-level performance study on GPU and CPU hardware, it shows that Vanka relaxation is particularly effective on GPUs when memory movement is optimized via patch sharing and shared memory, outperforming Braess-Sarazin by over 20% in some GPU runs. The work provides a roadmap for implementing high-performance monolithic multigrid solvers on modern architectures and highlights the practical importance of memory-bound optimization for saddle-point problems in scientific computing.

Abstract

In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a block-triangular preconditioner and a monolithic multigrid method with the three most common choices of relaxation scheme - Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show monolithic multigrid can outperform block-triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient. Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to $100\%$ slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than $20\%$ on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.

Exploiting mesh structure to improve multigrid performance for saddle point problems

TL;DR

The paper addresses efficient solution of saddle-point systems from finite-element discretizations of incompressible flow, focusing on Stokes equations discretized with a

Taylor–Hood scheme. It compares block-triangular preconditioners with monolithic multigrid, examining Braess-Sarazin, Vanka, and Schur-Uzawa relaxation, and demonstrates that exploiting structured mesh data and careful GPU implementation can make monolithic multigrid with Vanka or Braess-Sarazin superior to block-factorization approaches. Through a detailed kernel-level performance study on GPU and CPU hardware, it shows that Vanka relaxation is particularly effective on GPUs when memory movement is optimized via patch sharing and shared memory, outperforming Braess-Sarazin by over 20% in some GPU runs. The work provides a roadmap for implementing high-performance monolithic multigrid solvers on modern architectures and highlights the practical importance of memory-bound optimization for saddle-point problems in scientific computing.

Abstract

slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than

on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.

Paper Structure (24 sections, 22 equations, 14 figures, 3 tables, 7 algorithms)

This paper contains 24 sections, 22 equations, 14 figures, 3 tables, 7 algorithms.

Introduction
The Stokes equations and their discretization
Problem setup
Discretization
Structured matrix representation
Multigrid
Braess-Sarazin relaxation scheme
Vanka relaxation scheme
Schur-Uzawa relaxation scheme
Block-Triangular preconditioner
Our implementation
Existing work
Performance Analysis
Test System
Kernels
...and 9 more sections

Figures (14)

Figure 1: Visualization of three components of the manufactured solution.
Figure 2: Illustration of the degrees of freedom for a Q2 and Q1 element, with different types of degrees of freedom identified by different shapes.
Figure 3: Local numbering of degrees of freedom around nodal degree of freedom $5$.
Figure 4: Illustration of overlapping $2\times 2$ Vanka patches
Figure 5: Braess-Sarazin: Kernels and their proportion of runtime
...and 9 more figures

Exploiting mesh structure to improve multigrid performance for saddle point problems

TL;DR

Abstract

Exploiting mesh structure to improve multigrid performance for saddle point problems

Authors

TL;DR

Abstract

Table of Contents

Figures (14)