Table of Contents
Fetching ...

A numerically stable communication-avoiding s-step GMRES algorithm

Zan Xu, Juan J. Alonso, Eric Darve

TL;DR

This work tackles the communication bottleneck of Krylov solvers on modern HPC by introducing a numerically stable, adaptive $s$-step GMRES that dynamically selects the step size $s$ using a partial CholQR framework. It combines a stability-aware polynomial basis, notably scaled Newton polynomials, with an initial-step estimator to maximize communication savings while maintaining numerical stability, and demonstrates near-linear parallel scalability on up to 114{,}000 cores. The authors analyze how polynomial bases and preconditioning affect stability, and provide extensive numerical experiments across diagonal, Laplace, and real-world matrices, including SuiteSparse problems, to validate stability and performance. The practical impact is a robust, scalable GMRES variant that significantly reduces communication costs without sacrificing accuracy or convergence on large-scale HPC systems.

Abstract

Krylov subspace methods are extensively used in scientific computing to solve large-scale linear systems. However, the performance of these iterative Krylov solvers on modern supercomputers is limited by expensive communication costs. The $s$-step strategy generates a series of $s$ Krylov vectors at a time to avoid communication. Asymptotically, the $s$-step approach can reduce communication latency by a factor of $s$. Unfortunately, due to finite-precision implementation, the step size has to be kept small for stability. In this work, we tackle the numerical instabilities encountered in the $s$-step GMRES algorithm. By choosing an appropriate polynomial basis and block orthogonalization schemes, we construct a communication avoiding $s$-step GMRES algorithm that automatically selects the optimal step size to ensure numerical stability. To further maximize communication savings, we introduce scaled Newton polynomials that can increase the step size $s$ to a few hundreds for many problems. An initial step size estimator is also developed to efficiently choose the optimal step size for stability. The guaranteed stability of the proposed algorithm is demonstrated using numerical experiments. In the process, we also evaluate how the choice of polynomial and preconditioning affects the stability limit of the algorithm. Finally, we show parallel scalability on more than 114,000 cores in a distributed-memory setting. Perfectly linear scaling has been observed in both strong and weak scaling studies with negligible communication costs.

A numerically stable communication-avoiding s-step GMRES algorithm

TL;DR

This work tackles the communication bottleneck of Krylov solvers on modern HPC by introducing a numerically stable, adaptive -step GMRES that dynamically selects the step size using a partial CholQR framework. It combines a stability-aware polynomial basis, notably scaled Newton polynomials, with an initial-step estimator to maximize communication savings while maintaining numerical stability, and demonstrates near-linear parallel scalability on up to 114{,}000 cores. The authors analyze how polynomial bases and preconditioning affect stability, and provide extensive numerical experiments across diagonal, Laplace, and real-world matrices, including SuiteSparse problems, to validate stability and performance. The practical impact is a robust, scalable GMRES variant that significantly reduces communication costs without sacrificing accuracy or convergence on large-scale HPC systems.

Abstract

Krylov subspace methods are extensively used in scientific computing to solve large-scale linear systems. However, the performance of these iterative Krylov solvers on modern supercomputers is limited by expensive communication costs. The -step strategy generates a series of Krylov vectors at a time to avoid communication. Asymptotically, the -step approach can reduce communication latency by a factor of . Unfortunately, due to finite-precision implementation, the step size has to be kept small for stability. In this work, we tackle the numerical instabilities encountered in the -step GMRES algorithm. By choosing an appropriate polynomial basis and block orthogonalization schemes, we construct a communication avoiding -step GMRES algorithm that automatically selects the optimal step size to ensure numerical stability. To further maximize communication savings, we introduce scaled Newton polynomials that can increase the step size to a few hundreds for many problems. An initial step size estimator is also developed to efficiently choose the optimal step size for stability. The guaranteed stability of the proposed algorithm is demonstrated using numerical experiments. In the process, we also evaluate how the choice of polynomial and preconditioning affects the stability limit of the algorithm. Finally, we show parallel scalability on more than 114,000 cores in a distributed-memory setting. Perfectly linear scaling has been observed in both strong and weak scaling studies with negligible communication costs.
Paper Structure (31 sections, 30 equations, 15 figures, 1 table, 4 algorithms)

This paper contains 31 sections, 30 equations, 15 figures, 1 table, 4 algorithms.

Figures (15)

  • Figure 1: Evolution of Krylov vectors $V_j$ (top) and their corresponding coefficients $C_{i,j}$ in the eigenspace of $A$(bottom) under the action of Scaled Newton polynomials. The two representations are related via $V_j = UC_j$. The entries highlighted in red are $O(\varepsilon)$ if Ritz values are well-approximated.
  • Figure 1: Diagonal matrix of size $N=10^4$ with evenly distributed eigenvalues in (0.1,10). The top sub-figure shows the relative residual and the LOO errors of GMRES and the adaptive $s$-step GMRES with a monomial basis. The relative residual of the adaptive $s$-step GMRES algorithm agrees with the baseline. The LOO error of the adaptive s-step GMRES algorithm is maintained near the machine epsilon. The lower left sub-figure shows the adapted block size, $s$. In this case, the adapted step size is constant at $s = 6$. The lower right sub-figure shows the incremental condition number of the triangular matrix seen by the first instance of partial CholQR in the first block iteration. The ICE estimate is very close to the true value given by the SVD.
  • Figure 1: Timing breakdown for weak scaling test of 3D Laplace matrix. Every four columns from left to right are: MGS-GMRES (MGS), CGS2-GMRES (CGS2), BCGS2-TSQR-GMRES (TSQR), adaptive s-step GMRES (Adapt.). The timing of SpMV/MPK is inverted and plotted below the x-axis in the bar plot to visually showcase the scalability of different operations without stacking all of them. The orthogonalization steps include both computational costs (Comp.) and communication costs (Comm.). Other miscellaneous operations such as updating the upper Hessenberg matrix and checking for convergence are included as well though their timings are almost negligible.
  • Figure 2: 2D Laplace matrix of size $N=1.6\times10^{\color{red}{}5}$ using 5-stencil finite difference discretization. The top sub-figure shows the relative residual and LOO errors of GMRES and adaptive s-step GMRES with a monomial basis. Good agreement of the relative residuals between the baseline and the adaptive algorithm can be observed. LOO of the adaptive s-step GMRES is kept near the machine epsilon. The lower left sub-figure shows the adapted block size, $s$. In this case, the adapted step size is constant at $s=6$. The lower right sub-figure shows the incremental condition number of the triangular matrix seen by the first instance of partial CholQR in the first block iteration. The ICE estimate agrees well with the true SVD estimate.
  • Figure 2: Timing breakdown for weak scaling test of preconditioned 3D Laplace matrix using the adaptive s-step GMRES algorithm. The timings of SpMVs and preconditioning steps (Prec.) are inverted in the bar plot to showcase scalability of different operations. The orthogonalization steps include both computational costs (Comp.) and communication costs (Comm.). Other miscellaneous operations such as updating the upper Hessenberg matrix and checking for convergence are included as well though their timings are almost negligible.
  • ...and 10 more figures