Table of Contents
Fetching ...

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

Pasqua D'Ambra, Massimo Bernaschi, Mauro G. Carrozzo, Stephen Thomas

TL;DR

Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.

Abstract

We present a variant of the s-step Preconditioned Conjugate Gradient (PCG) method that combines a Chebyshev-stabilized Krylov basis with a Forward Gauss-Seidel (FGS) iteration for the solution of the reduced Gram systems. In s-step Conjugate Gradient, multiple search directions are generated per outer iteration, reducing global synchronization costs but requiring the solution of small dense Gram systems whose conditioning is critical for stability. We analyze the structure of the Chebyshev Gram matrix and show that its moment-based representation is associated with favorable conditioning properties for moderate step sizes. Building on inexact Krylov theory and on the classical equivalence between FGS and Modified Gram-Schmidt (MGS), we provide a structural analysis and theoretical rationale supporting the use of a small number of FGS sweeps, while preserving the convergence behavior observed in practical regimes. Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.

Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram Solve

TL;DR

Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.

Abstract

We present a variant of the s-step Preconditioned Conjugate Gradient (PCG) method that combines a Chebyshev-stabilized Krylov basis with a Forward Gauss-Seidel (FGS) iteration for the solution of the reduced Gram systems. In s-step Conjugate Gradient, multiple search directions are generated per outer iteration, reducing global synchronization costs but requiring the solution of small dense Gram systems whose conditioning is critical for stability. We analyze the structure of the Chebyshev Gram matrix and show that its moment-based representation is associated with favorable conditioning properties for moderate step sizes. Building on inexact Krylov theory and on the classical equivalence between FGS and Modified Gram-Schmidt (MGS), we provide a structural analysis and theoretical rationale supporting the use of a small number of FGS sweeps, while preserving the convergence behavior observed in practical regimes. Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.
Paper Structure (22 sections, 12 theorems, 82 equations, 11 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 12 theorems, 82 equations, 11 figures, 1 table, 2 algorithms.

Key Result

Proposition 1

Let $\alpha^{(0)} = 0$ and let $\alpha^{(\nu)}$ be obtained after $\nu$ forward Gauss--Seidel sweeps on $W\alpha = m$ with $W = I + L + L^\top$. Then the residual satisfies Consequently,

Figures (11)

  • Figure 1: Empirical structure of the Chebyshev Gram matrices for the test problem considered in this section with step size $s=10$. The panels correspond to outer iterations $k=5,\ldots,8$ of the PCG-S method. A progressive concentration of the matrix entries around the diagonal can be observed as the iteration proceeds.
  • Figure 2: Gram-solve relative residuals $\|r_\alpha\|_2/\|m\|_2$ (for the $\alpha$ system) and $\|R_\beta\|_F/\|M\|_F$ (for the $\beta$ system) versus outer iteration, for two values of the FGS sweep count $\nu$ and several step sizes $s$, on the 27-point Poisson problem with $5.12\times10^8$ DOFs on $64$ GPUs.
  • Figure 3: PCG-S outer relative residual $\|r^{(k)}\|/\|b\|$ versus outer iteration, for two values of the FGS sweep count $\nu$ and step sizes $s \in \{4,6\}$, compared to classical PCG and the Cholesky-based Gram solve. Problem: 27-point Poisson, $5.12\times10^8$ DOFs on $64$ GPUs.
  • Figure 4: Strong--scaling behavior of $\Delta_{\mathrm{strong}}(P,s,\nu=30)$ as a function of the number of processes $P$ for several step sizes $s$, with fixed $n=500^3$. The dashed line marks $\Delta_{\mathrm{strong}}=0$.
  • Figure 5: Weak--scaling behavior of $\Delta_{\mathrm{weak}}(P,s,\nu=30)$ versus the number of processes $P$, for several step sizes $s$, with scaling factor $c=200^3$. The dashed line marks $\Delta_{\mathrm{weak}}=0$.
  • ...and 6 more figures

Theorems & Definitions (25)

  • Remark 1
  • Remark 2
  • Proposition 1: Residual after $\nu$ FGS sweeps
  • proof
  • Corollary 1: Asymptotic residual reduction rate
  • Remark 3
  • Lemma 1: Chebyshev product formula
  • Theorem 1: Chebyshev Gram matrix structure
  • proof
  • Corollary 2: Diagonal entries
  • ...and 15 more