On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt
Erin Carson, Kathryn Lund, Yuxin Ma, Eda Oktay
TL;DR
This paper develops an abstract framework to analyze the numerical stability of low-synchronization block Gram-Schmidt algorithms and shows that a strong intrablock orthogonalization is only necessary for the first block to maintain orthogonality at unit roundoff. It derives stability bounds for both non-reorthogonalized and reorthogonalized variants, demonstrating that reducing synchronization points degrades stability, and identifies a viable one-sync reorthogonalized method for the column case. The results reveal that DCGS2 and CGS-2 are as stable as Householder QR in the single-column setting, while block methods face stringent condition-number restrictions as sync-points are removed. Numerical experiments in BlockStab illustrate the trade-offs between intraorthogonalization subroutines and synchronization efficiency. The findings guide the design of communication-avoiding orthogonalization schemes suitable for exascale computing, highlighting both possibilities and limitations of low-sync variants.
Abstract
Interest in communication-avoiding orthogonalization schemes for high-performance computing has been growing recently. This manuscript addresses open questions about the numerical stability of various block classical Gram-Schmidt variants that have been proposed in the past few years. An abstract framework is employed, the flexibility of which allows for new rigorous bounds on the loss of orthogonality in these variants. We first analyze a generalization of (reorthogonalized) block classical Gram-Schmidt and show that a "strong" intrablock orthogonalization routine is only needed for the very first block in order to maintain orthogonality on the level of the unit roundoff. In particular, this ``strong" first step does not have to be a reorthogonalized QR itself and subsequent steps can use less stable QR variants, thus keeping the overall communication costs low. Then, using this variant, which has four synchronization points per block column, we remove the synchronization points one at a time and analyze how each alteration affects the stability of the resulting method. Our analysis shows that the variant requiring only one synchronization per block column cannot be guaranteed to be stable in practice, as stability begins to degrade with the first reduction of synchronization points. Our analysis of block methods also provides new theoretical results for the single-column case. In particular, it is proven that DCGS2 from [Bielich, D. et al. Par. Comput. 112 (2022)] and CGS-2 from [Świrydowicz, K. et al, Num. Lin. Alg. Appl. 28 (2021)] are as stable as Householder QR. Numerical examples from the BlockStab toolbox are included throughout, to help compare variants and illustrate the effects of different choices of intraorthogonalization subroutines.
