Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

Pasqua D'Ambra; Massimo Bernaschi; Mauro G. Carrozzo; Stephen Thomas

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

Pasqua D'Ambra, Massimo Bernaschi, Mauro G. Carrozzo, Stephen Thomas

TL;DR

Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.

Abstract

We present a variant of the s-step Preconditioned Conjugate Gradient (PCG) method that combines a Chebyshev-stabilized Krylov basis with a Forward Gauss-Seidel (FGS) iteration for the solution of the reduced Gram systems. In s-step Conjugate Gradient, multiple search directions are generated per outer iteration, reducing global synchronization costs but requiring the solution of small dense Gram systems whose conditioning is critical for stability. We analyze the structure of the Chebyshev Gram matrix and show that its moment-based representation is associated with favorable conditioning properties for moderate step sizes. Building on inexact Krylov theory and on the classical equivalence between FGS and Modified Gram-Schmidt (MGS), we provide a structural analysis and theoretical rationale supporting the use of a small number of FGS sweeps, while preserving the convergence behavior observed in practical regimes. Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

TL;DR

Abstract

Paper Structure (22 sections, 12 theorems, 82 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 12 theorems, 82 equations, 11 figures, 1 table, 2 algorithms.

Introduction
The $s$-step PCG method with Chebyshev Krylov basis
Iterated Gauss–Seidel Solution of the Chebyshev Gram System
Forward Gauss--Seidel iteration
Structure of Chebyshev Gram Matrices
Connection between FGS and MGS
Extension to the $A$-inner product.
Implications for stability.
Numerical Experiments
Practical Implementation
Performance Model
Interpretation of the difference.
Communication term.
Computational term.
Gram solve overhead.
...and 7 more sections

Key Result

Proposition 1

Let $\alpha^{(0)} = 0$ and let $\alpha^{(\nu)}$ be obtained after $\nu$ forward Gauss--Seidel sweeps on $W\alpha = m$ with $W = I + L + L^\top$. Then the residual satisfies Consequently,

Figures (11)

Figure 1: Empirical structure of the Chebyshev Gram matrices for the test problem considered in this section with step size $s=10$. The panels correspond to outer iterations $k=5,\ldots,8$ of the PCG-S method. A progressive concentration of the matrix entries around the diagonal can be observed as the iteration proceeds.
Figure 2: Gram-solve relative residuals $\|r_\alpha\|_2/\|m\|_2$ (for the $\alpha$ system) and $\|R_\beta\|_F/\|M\|_F$ (for the $\beta$ system) versus outer iteration, for two values of the FGS sweep count $\nu$ and several step sizes $s$, on the 27-point Poisson problem with $5.12\times10^8$ DOFs on $64$ GPUs.
Figure 3: PCG-S outer relative residual $\|r^{(k)}\|/\|b\|$ versus outer iteration, for two values of the FGS sweep count $\nu$ and step sizes $s \in \{4,6\}$, compared to classical PCG and the Cholesky-based Gram solve. Problem: 27-point Poisson, $5.12\times10^8$ DOFs on $64$ GPUs.
Figure 4: Strong--scaling behavior of $\Delta_{\mathrm{strong}}(P,s,\nu=30)$ as a function of the number of processes $P$ for several step sizes $s$, with fixed $n=500^3$. The dashed line marks $\Delta_{\mathrm{strong}}=0$.
Figure 5: Weak--scaling behavior of $\Delta_{\mathrm{weak}}(P,s,\nu=30)$ versus the number of processes $P$, for several step sizes $s$, with scaling factor $c=200^3$. The dashed line marks $\Delta_{\mathrm{weak}}=0$.
...and 6 more figures

Theorems & Definitions (25)

Remark 1
Remark 2
Proposition 1: Residual after $\nu$ FGS sweeps
proof
Corollary 1: Asymptotic residual reduction rate
Remark 3
Lemma 1: Chebyshev product formula
Theorem 1: Chebyshev Gram matrix structure
proof
Corollary 2: Diagonal entries
...and 15 more

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

TL;DR

Abstract

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (25)