On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

Erin Carson; Kathryn Lund; Yuxin Ma; Eda Oktay

On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

Erin Carson, Kathryn Lund, Yuxin Ma, Eda Oktay

TL;DR

This paper develops an abstract framework to analyze the numerical stability of low-synchronization block Gram-Schmidt algorithms and shows that a strong intrablock orthogonalization is only necessary for the first block to maintain orthogonality at unit roundoff. It derives stability bounds for both non-reorthogonalized and reorthogonalized variants, demonstrating that reducing synchronization points degrades stability, and identifies a viable one-sync reorthogonalized method for the column case. The results reveal that DCGS2 and CGS-2 are as stable as Householder QR in the single-column setting, while block methods face stringent condition-number restrictions as sync-points are removed. Numerical experiments in BlockStab illustrate the trade-offs between intraorthogonalization subroutines and synchronization efficiency. The findings guide the design of communication-avoiding orthogonalization schemes suitable for exascale computing, highlighting both possibilities and limitations of low-sync variants.

Abstract

Interest in communication-avoiding orthogonalization schemes for high-performance computing has been growing recently. This manuscript addresses open questions about the numerical stability of various block classical Gram-Schmidt variants that have been proposed in the past few years. An abstract framework is employed, the flexibility of which allows for new rigorous bounds on the loss of orthogonality in these variants. We first analyze a generalization of (reorthogonalized) block classical Gram-Schmidt and show that a "strong" intrablock orthogonalization routine is only needed for the very first block in order to maintain orthogonality on the level of the unit roundoff. In particular, this ``strong" first step does not have to be a reorthogonalized QR itself and subsequent steps can use less stable QR variants, thus keeping the overall communication costs low. Then, using this variant, which has four synchronization points per block column, we remove the synchronization points one at a time and analyze how each alteration affects the stability of the resulting method. Our analysis shows that the variant requiring only one synchronization per block column cannot be guaranteed to be stable in practice, as stability begins to degrade with the first reduction of synchronization points. Our analysis of block methods also provides new theoretical results for the single-column case. In particular, it is proven that DCGS2 from [Bielich, D. et al. Par. Comput. 112 (2022)] and CGS-2 from [Świrydowicz, K. et al, Num. Lin. Alg. Appl. 28 (2021)] are as stable as Householder QR. Numerical examples from the BlockStab toolbox are included throughout, to help compare variants and illustrate the effects of different choices of intraorthogonalization subroutines.

On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

TL;DR

Abstract

Paper Structure (12 sections, 18 theorems, 176 equations, 11 figures, 3 tables, 5 algorithms)

This paper contains 12 sections, 18 theorems, 176 equations, 11 figures, 3 tables, 5 algorithms.

Introduction
Improved stability of BCGS with inner reorthgonalization
An abstract framework for block Gram-Schmidt
Loss of orthogonality of BCGS-A
Loss of orthogonality of BCGSI+A
Derivation of a one-sync, reorthogonalized, block Gram-Schmidt method
Loss of orthogonality of low-sync versions of BCGSI+A
BCGSI+A-3S
BCGSI+A-2S
BCGSI+A-1S
Summary and consequences of bounds
Conclusions

Key Result

Lemma 1

Assume that $\bar{\bm{G}}$, $\tilde{\bm{G}}$, $\bar{\bm{Q}}$, $\bar{R}$, and $\bar{\bm{\mathcal{Q}}}_{prev}$ satisfy eq:epsproj, eq:epsqr, and eq:epsQkp, and that is satisfied. Furthermore, assume that $\bar{R}$ is nonsingular. Then and $\bar{\bm{\mathcal{Q}}}_{new} = [\bar{\bm{\mathcal{Q}}}_{prev}, \bar{\bm{Q}}]$ satisfies

Figures (11)

Figure 1: Comparison among \ref{['alg:BCGSA']}, \ref{['alg:BCGSA']}, and \ref{['alg:BCGSIROA']} on a class of monomial matrices from BlockStab.
Figure 2: Comparison between \ref{['alg:BCGSIROA']} (i.e., Algorithm \ref{['alg:BCGSIROA']} with all IOs equal) and \ref{['alg:BCGSIROA']} ($\texttt{IO}_{\mathrm{A}}\xspace = \texttt{HouseQR}\xspace$ and $\texttt{IO}_1\xspace = \texttt{IO}_2\xspace$) on a class of monomial matrices.
Figure 3: Comparison between \ref{['alg:BCGSIROA']} (i.e., Algorithm \ref{['alg:BCGSIROA']} with all IOs equal) and \ref{['alg:BCGSIROA']} ($\texttt{IO}_{\mathrm{A}}\xspace = \texttt{HouseQR}\xspace$ and $\texttt{IO}_1\xspace = \texttt{IO}_2\xspace$) on a class of piled matrices.
Figure 4: Comparison between \ref{['alg:BCGSIROA']} and \ref{['alg:BCGSIROA3S']} on a class of monomial matrices. Note that $\texttt{IO}_{\mathrm{A}}\xspace$ is fixed as HouseQR, and $\texttt{IO}\xspace = \texttt{IO}_1\xspace = \texttt{IO}_2\xspace$.
Figure 5: Comparison among low-sync versions of \ref{['alg:BCGSIROA']} on a class of monomial matrices. Note that $\texttt{IO}_{\mathrm{A}}\xspace$ is fixed as HouseQR.
...and 6 more figures

Theorems & Definitions (33)

Lemma 1
Lemma 2
proof
proof : Proof of Lemma \ref{['lem:epsQk-relation']}
Lemma 3
proof
Lemma 4
proof
Lemma 5
Theorem 1
...and 23 more

On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

TL;DR

Abstract

On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (33)