A numerically stable communication-avoiding s-step GMRES algorithm
Zan Xu, Juan J. Alonso, Eric Darve
TL;DR
This work tackles the communication bottleneck of Krylov solvers on modern HPC by introducing a numerically stable, adaptive $s$-step GMRES that dynamically selects the step size $s$ using a partial CholQR framework. It combines a stability-aware polynomial basis, notably scaled Newton polynomials, with an initial-step estimator to maximize communication savings while maintaining numerical stability, and demonstrates near-linear parallel scalability on up to 114{,}000 cores. The authors analyze how polynomial bases and preconditioning affect stability, and provide extensive numerical experiments across diagonal, Laplace, and real-world matrices, including SuiteSparse problems, to validate stability and performance. The practical impact is a robust, scalable GMRES variant that significantly reduces communication costs without sacrificing accuracy or convergence on large-scale HPC systems.
Abstract
Krylov subspace methods are extensively used in scientific computing to solve large-scale linear systems. However, the performance of these iterative Krylov solvers on modern supercomputers is limited by expensive communication costs. The $s$-step strategy generates a series of $s$ Krylov vectors at a time to avoid communication. Asymptotically, the $s$-step approach can reduce communication latency by a factor of $s$. Unfortunately, due to finite-precision implementation, the step size has to be kept small for stability. In this work, we tackle the numerical instabilities encountered in the $s$-step GMRES algorithm. By choosing an appropriate polynomial basis and block orthogonalization schemes, we construct a communication avoiding $s$-step GMRES algorithm that automatically selects the optimal step size to ensure numerical stability. To further maximize communication savings, we introduce scaled Newton polynomials that can increase the step size $s$ to a few hundreds for many problems. An initial step size estimator is also developed to efficiently choose the optimal step size for stability. The guaranteed stability of the proposed algorithm is demonstrated using numerical experiments. In the process, we also evaluate how the choice of polynomial and preconditioning affects the stability limit of the algorithm. Finally, we show parallel scalability on more than 114,000 cores in a distributed-memory setting. Perfectly linear scaling has been observed in both strong and weak scaling studies with negligible communication costs.
