Table of Contents
Fetching ...

A GEMM-based direct solver for finite-difference Poisson problems in non-uniform grids

Pedro Costa, Duarte Palancha, Joshua Romero, Roberto Verzicco, Massimiliano Fatica

TL;DR

A direct Poisson solver for massively parallel simulations on three-dimensional Cartesian grids with non-uniform spacing is presented, enabling efficient high-resolution stretched-mesh simulations on modern heterogeneous systems.

Abstract

We present a direct Poisson solver for massively parallel simulations on three-dimensional Cartesian grids with non-uniform spacing. The method uses a tensor-based formulation in which the operator is diagonalized numerically along two directions through one-dimensional eigendecompositions, while the third direction is solved directly. The resulting dense transforms are evaluated efficiently as GEMMs (General Matrix--Matrix Multiplications), allowing many independent one-dimensional operations to be combined into matrix-matrix products that map well to modern CPU and GPU hardware. For uniform grids, the method reduces to the classical eigenfunction-expansion approach, and it naturally supports hybrid combinations of FFT-based and GEMM-based transforms depending on grid uniformity. After coupling the solver to an incompressible Navier-Stokes code, we assess its accuracy and performance against geometric multigrid and block cyclic reduction with FFT diagonalization. The results show that the proposed method is robust and consistently achieves the best time-to-solution. In strong scaling, the more compute-intensive GEMM-based variants attain higher parallel efficiency by better amortizing communication costs, while weak scaling highlights the expected trade-off between FFT-based and dense-transform formulations. Overall, the method enables efficient high-resolution stretched-mesh simulations on modern heterogeneous systems.

A GEMM-based direct solver for finite-difference Poisson problems in non-uniform grids

TL;DR

A direct Poisson solver for massively parallel simulations on three-dimensional Cartesian grids with non-uniform spacing is presented, enabling efficient high-resolution stretched-mesh simulations on modern heterogeneous systems.

Abstract

We present a direct Poisson solver for massively parallel simulations on three-dimensional Cartesian grids with non-uniform spacing. The method uses a tensor-based formulation in which the operator is diagonalized numerically along two directions through one-dimensional eigendecompositions, while the third direction is solved directly. The resulting dense transforms are evaluated efficiently as GEMMs (General Matrix--Matrix Multiplications), allowing many independent one-dimensional operations to be combined into matrix-matrix products that map well to modern CPU and GPU hardware. For uniform grids, the method reduces to the classical eigenfunction-expansion approach, and it naturally supports hybrid combinations of FFT-based and GEMM-based transforms depending on grid uniformity. After coupling the solver to an incompressible Navier-Stokes code, we assess its accuracy and performance against geometric multigrid and block cyclic reduction with FFT diagonalization. The results show that the proposed method is robust and consistently achieves the best time-to-solution. In strong scaling, the more compute-intensive GEMM-based variants attain higher parallel efficiency by better amortizing communication costs, while weak scaling highlights the expected trade-off between FFT-based and dense-transform formulations. Overall, the method enables efficient high-resolution stretched-mesh simulations on modern heterogeneous systems.
Paper Structure (12 sections, 18 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 12 sections, 18 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Illustrative sketch of the non-uniform one-dimensional finite-difference grid and spacing definitions. Black circles denote the nodal locations $x_i$ associated with the unknowns $\phi_i$, while $x_{i\pm 1/2}$ indicate the half-index locations. See the main text for the non-uniform grid spacing definitions.
  • Figure 2: Illustration of the domain decomposition and computation/communication operations in the Poisson solver. The physical domain (and coordinate axes) is fixed throughout; only the data distribution among (MPI) tasks changes, as depicted by the colors. The algorithm starts with $x$-aligned pencils (left), then performs collective transpose operations that redistribute the same data to $y$-aligned pencils (middle) and subsequently to $z$-aligned pencils (right). These transposes are required because the forward/backward eigenbasis transforms (GEMM- or FFT-based) along $x$ and $y$ act on lines that must be local to each task. Finally, the resulting tridiagonal systems along $z$ can be solved either locally (TDMA; present work) or using a distributed algorithm such as PCR--TDMA Diez-et-al-CPC-2025.
  • Figure 3: Validation test cases. (a) Lid-driven cavity: velocity profiles along the cavity centerlines for a cubic cavity at $\mathrm{Re}=1000$ (center plane $z=0$) and a square cavity in the Stokes limit ($\mathrm{Re}\approx 0$). Circles correspond to the reference DNS data of Ku-et-al-JCP-1987, and squares to a reference solution from a vorticity--streamfunction finite-difference Stokes solver. The profiles follow their color-matched axes: the blue curves represent the wall-normal velocity component $v$ along $x$, and the black curves the streamwise velocity component $u$ along $y$. (b) Turbulent square duct at $\mathrm{Re}=4410$: mean streamwise velocity sampled along the cross-section diagonal ($s=\sqrt{2}\,y=\sqrt{2}\,z$; $d=2\sqrt{2}\,h$ is the diagonal length), compared to the DNS data of Gavrilakis-JFM-1992 (dashed lines).
  • Figure 4: Single-core CPU performance for the Poisson solve on a $128^3$ cubic grid with the same boundary conditions as the square-duct case. The vertical axis reports the average wall-clock time per Poisson solve. See the main text for the definition of the different cases under the same method. It is worth remarking that the remainder of the Navier--Stokes solver calculations take about $0.079\,\mathrm{s}$ per RK3 stage, leading to a Poisson solver footprint of about $50$--$63\%$ for the present method, $87\%$ for FFT+BLKTRI, and $87$--$99\%$ for the multigrid cases.
  • Figure 5: Breakdown of different operation contributions to the single-core CPU performance for the Poisson solve on a $128^3$ cubic grid for different $x$/$y$ operator combinations (FF/GF/FG/GG): FFTs/GEMMs along $x$ and $y$ (blues/reds; cumulative forward and backward), collective domain transpose operations (greens; cumulative forward and backward), and TDMA step along $z$ (purples). "Other" denotes other minor contributions related to local array copies (light gray).
  • ...and 7 more figures