Table of Contents
Fetching ...

A GPU-Accelerated Matrix-Free FAS Multigrid Solver for Navier-Stokes Equations with Memory-Efficient Implementations

Jiale Meng, Shuqi Tang, Steven M. Wise, Zhenlin Guo

TL;DR

The paper introduces a memory-efficient, GPU-accelerated matrix-free Full Approximation Storage (FAS) multigrid solver for Navier–Stokes and multiphase problems on staggered grids, implemented in MATLAB. A novel X-shape Multi-Color Gauss–Seidel (X-MCGS) smoother eliminates branching and enhances GPU parallelism, while inter-grid transfer and smoothing components are fully GPU-accelerated. Memory reuse strategies reduce GPU resident variables by up to ~60% (e.g., from 12–15 to 8), enabling large 2D and 3D simulations (up to $512^3$) on a single RTX GPU, with substantial speedups over CPU baselines (e.g., $61\times$ in 2D at $8192^2$, $46\times$ in 3D at $512^3$). The framework is validated on grain growth, vesicle phase separation, and air–water two-bubble coalescence, demonstrating both accuracy and scalability, and lays groundwork for future multi-GPU extensions.

Abstract

We develop a matrix-free Full Approximation Storage (FAS) multigrid solver based on staggered finite differences and implemented on GPU in MATLAB. To enhance performance, intermediate variables are reused, and an X-shape Multi-Color Gauss-Seidel (X-MCGS) smoother is introduced, which eliminates conditional branching by partitioning the grid into four submatrices. Restriction and prolongation operators are also GPU-accelerated. Convergence tests verify robustness and accuracy, while benchmarks show substantial speedups: for the 2D heat equation on an $8192^2$ grid, the RTX~4090 achieves $61\times$ over a single-core CPU, and in 3D at $512^3$, $46\times$. A memory-efficient implementation of first- and second-order projection schemes reduces GPU-resident variables from 12/15 to 8, lowering memory footprint and improving performance by 20--30%, enabling $512^3$ Navier-Stokes simulations on a single GPU. Grain growth on a $512^2$ grid accommodates up to $q=1189$ (2D) and $q=123$ (3D) orientations, reproducing expected scaling laws. Coupled with Cahn-Hilliard equations, air-water two-bubble coalescence is simulated on a $256\times 256\times 1024$ grid, agreeing with experimental observations.

A GPU-Accelerated Matrix-Free FAS Multigrid Solver for Navier-Stokes Equations with Memory-Efficient Implementations

TL;DR

The paper introduces a memory-efficient, GPU-accelerated matrix-free Full Approximation Storage (FAS) multigrid solver for Navier–Stokes and multiphase problems on staggered grids, implemented in MATLAB. A novel X-shape Multi-Color Gauss–Seidel (X-MCGS) smoother eliminates branching and enhances GPU parallelism, while inter-grid transfer and smoothing components are fully GPU-accelerated. Memory reuse strategies reduce GPU resident variables by up to ~60% (e.g., from 12–15 to 8), enabling large 2D and 3D simulations (up to ) on a single RTX GPU, with substantial speedups over CPU baselines (e.g., in 2D at , in 3D at ). The framework is validated on grain growth, vesicle phase separation, and air–water two-bubble coalescence, demonstrating both accuracy and scalability, and lays groundwork for future multi-GPU extensions.

Abstract

We develop a matrix-free Full Approximation Storage (FAS) multigrid solver based on staggered finite differences and implemented on GPU in MATLAB. To enhance performance, intermediate variables are reused, and an X-shape Multi-Color Gauss-Seidel (X-MCGS) smoother is introduced, which eliminates conditional branching by partitioning the grid into four submatrices. Restriction and prolongation operators are also GPU-accelerated. Convergence tests verify robustness and accuracy, while benchmarks show substantial speedups: for the 2D heat equation on an grid, the RTX~4090 achieves over a single-core CPU, and in 3D at , . A memory-efficient implementation of first- and second-order projection schemes reduces GPU-resident variables from 12/15 to 8, lowering memory footprint and improving performance by 20--30%, enabling Navier-Stokes simulations on a single GPU. Grain growth on a grid accommodates up to (2D) and (3D) orientations, reproducing expected scaling laws. Coupled with Cahn-Hilliard equations, air-water two-bubble coalescence is simulated on a grid, agreeing with experimental observations.

Paper Structure

This paper contains 28 sections, 23 equations, 20 figures, 8 tables, 1 algorithm.

Figures (20)

  • Figure 1: Illustration of the fine grid $\Omega_1$ in the two-grid framework, which can be further generalized to multilevel grid hierarchies. The figure presents a 2D uniform staggered discretization with cell-centered and edge-centered variables. The left panel depicts the global grid configuration, whereas the right panel provides a magnified view of a representative control volume with mesh size $h_1$. The symbols denote the types of variables as specified in the legend. See § \ref{['sec2.1']} for details.
  • Figure 2: Illustration of three MCGS partitioning patterns used for smoothing on 2D uniform grids: (left) X-shape, (middle) U-shape, and (right) Z-shape. The numbers (1–4) indicate the sub-block update order within each scheme, and the arrows show the sweep directions.
  • Figure 3: Convergence performance of six MCGS smoothing operators within the multigrid V-cycle in 2D. The experiment uses a $2048^2$ grid, with other details given in Section \ref{['sec3.1.2']}. Left: Residual histories for the X-shape, U-shape, and Z-shape partitioning patterns under two ordering sequences (1234–1234 and 1234–4321). The iterations proceed until $\text{res}<10^{-9}$. Right: Total number of multigrid iterations required by each configuration to reach the stopping criterion. The results show that the X-shape scheme with the 1234–1234 ordering consistently achieves the fastest convergence, requiring only about half as many iterations as the slowest U-shape schemes and roughly one-third as many as the slowest Z-shape schemes. See § \ref{['sec2.3']} for details.
  • Figure 4: Illustration of the X-MCGS partitioning in 2D. Left: The computational grid is decomposed into four disjoint sets (1--4) based on the parity of the row and column indices. Right: Sequential update order: (i) set 1, (ii) set 2, (iii) set 3, and (iv) set 4. White cells correspond to the values $*^{k,m-1}$ from the preceding smoothing step, whereas gray cells denote the new values $*^{k,m}$ updated from them during the current $m$-th smoothing of the $k$-th V-cycle. This decomposition enables vectorized sub-block updates, maximizes GPU parallelism, and eliminates conditional branching during parallel execution. See § \ref{['sec2.3']} for details.
  • Figure 5: Algebraic convergence of the matrix-free FAS multigrid solver in 2D, where the Algebraic error, defined as $\| p_{1,i,j}^{\text{exact}} - p_{1,i,j}^{k,m} \|_{L^2(\Omega_1)}$, and the residual, defined as Equation \ref{['residual']} both show nearly $h$-independent reduction as the iteration count $k$ increases. The legend indicates different grid sizes $N_1^2$ by colored lines. See § \ref{['sec3.1.1']} for details.
  • ...and 15 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3