Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

Ichitaro Yamazaki; Andrew J. Higgins; Erik G. Boman; Daniel B. Szyld

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

Ichitaro Yamazaki, Andrew J. Higgins, Erik G. Boman, Daniel B. Szyld

TL;DR

This work addresses the high communication cost of orthogonalization in GMRES on modern architectures by enhancing the s-step GMRES framework with a two-stage block orthogonalization scheme. The method pre-processes a block of $s$ basis vectors to keep conditioning under control and defers the heavier orthogonalization to a larger block of size $\\widehat{s}$, reducing global synchronization and improving data reuse. The authors analyze stability and demonstrate substantial performance gains (up to $2.6\times$ orthogonalization speedup and $1.6\times$ total speedup on Summit for 2D Laplace problems), with similar benefits observed on 3D problems and SuiteSparse matrices. Implemented in Trilinos and tested on GPU-accelerated clusters, the approach reduces synchronization requirements and can complement other stability techniques, offering a practical path to faster Krylov solvers on exascale systems.

Abstract

On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can generate extremely ill-conditioned basis vectors, and to maintain stability in practice, a conservatively small step size is used, which limits the performance of the $s$-step solver. To enhance the performance using a small step size, in this paper, we introduce a two-stage block orthogonalization scheme. Similar to the original scheme, the first stage of the proposed method operates on a block of $s$ basis vectors at a time, but its objective is to maintain the well-conditioning of the generated basis vectors with a lower cost. The orthogonalization of the basis vectors is delayed until the second stage when enough basis vectors are generated to obtain higher performance. Our analysis shows the stability of the proposed two-stage scheme. The performance is improved because while the same amount of computation as the original scheme is required, most of the communication is done at the second stage of the proposed scheme, reducing the overall communication requirements. Our performance results with up to 192 NVIDIA V100 GPUs on the Summit supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage approach can reduce the orthogonalization time and the total time-to-solution by the respective factors of up to $2.6\times$ and $1.6\times$ over the original $s$-step GMRES, which had already obtained the respective speedups of $2.1\times$ and $1.8\times$ over the standard GMRES. Similar speedups were obtained for 3D problems and for matrices from the SuiteSparse Matrix Collection.

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

TL;DR

basis vectors to keep conditioning under control and defers the heavier orthogonalization to a larger block of size

, reducing global synchronization and improving data reuse. The authors analyze stability and demonstrate substantial performance gains (up to

orthogonalization speedup and

total speedup on Summit for 2D Laplace problems), with similar benefits observed on 3D problems and SuiteSparse matrices. Implemented in Trilinos and tested on GPU-accelerated clusters, the approach reduces synchronization requirements and can complement other stability techniques, offering a practical path to faster Krylov solvers on exascale systems.

Abstract

-step variant orthogonalizes a block of

basis vectors at a time, potentially reducing the communication cost by a factor of

. Unfortunately, for a large step size

, the solver can generate extremely ill-conditioned basis vectors, and to maintain stability in practice, a conservatively small step size is used, which limits the performance of the

-step solver. To enhance the performance using a small step size, in this paper, we introduce a two-stage block orthogonalization scheme. Similar to the original scheme, the first stage of the proposed method operates on a block of

basis vectors at a time, but its objective is to maintain the well-conditioning of the generated basis vectors with a lower cost. The orthogonalization of the basis vectors is delayed until the second stage when enough basis vectors are generated to obtain higher performance. Our analysis shows the stability of the proposed two-stage scheme. The performance is improved because while the same amount of computation as the original scheme is required, most of the communication is done at the second stage of the proposed scheme, reducing the overall communication requirements. Our performance results with up to 192 NVIDIA V100 GPUs on the Summit supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage approach can reduce the orthogonalization time and the total time-to-solution by the respective factors of up to

and

over the original

-step GMRES, which had already obtained the respective speedups of

and

over the standard GMRES. Similar speedups were obtained for 3D problems and for matrices from the SuiteSparse Matrix Collection.

Paper Structure (12 sections, 3 theorems, 16 equations, 13 figures, 4 tables)

This paper contains 12 sections, 3 theorems, 16 equations, 13 figures, 4 tables.

Introduction
Related Work
$s$-step GMRES
Block Orthogonalization
BCGS2 with HHQR
BCGS2 with CholQR2
BCGS-PIP2
Two-stage Block Orthogonalization
Numerical Results
Implementation
Performance Results
Conclusion

Key Result

Theorem 4.1

With the bound eq:cholqr_bound and assumption eq:assumption-1, the condition number of $\widetilde{V}_j$ computed by the first CholQR (on Line 2 in Fig. algo:cholqr2) is bounded by and hence, the orthogonality error of $\widehat{Q}_j$ computed by CholQR2 satisfies

Figures (13)

Figure 1: Pseudocode of $s$-step GMRES where $[Q_j,R_j] = \hbox{BlkOrth}(Q,V_j)$ returns the QR factorization such that $Q R = V$ with $Q^TQ = I$ and $R$ is upper triangular with non-negative diagonals.
Figure 2: Block Classical Gram-Schmidt to generate a new set of orthonormal basis vectors $Q_j$. HHQR$(\widehat{V}_j)$ returns the QR factorization of $\widehat{V}_j$ based on the Householder algorithm, while the pseudocode of CholQR2 is shown in Fig. \ref{['algo:cholqr2']}.
Figure 3: Intra-block Cholesky QR to orthonormalize a set of vectors $\widehat{V} \in \mathbb{R}^{n\times s+1}$, where $\hbox{Chol}(G)$ returns the upper-triangular Cholesky factor of the Gram matrix $G$.
Figure 4: BCGS with Pythagorean Inner Product to generate a new set of orthonormal basis vectors $Q_j$.
Figure 5: Pseudocode of two-stage algorithms to generate the orthonormal basis vectors of "Big Panel" consisting of $\widehat{s}+1$ Krylov vectors.
...and 8 more figures

Theorems & Definitions (5)

Theorem 4.1
proof
Theorem 4.2
proof
Theorem 5.1

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

TL;DR

Abstract

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (5)