Table of Contents
Fetching ...

On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

Erin Carson, Kathryn Lund, Yuxin Ma, Eda Oktay

TL;DR

This paper develops an abstract framework to analyze the numerical stability of low-synchronization block Gram-Schmidt algorithms and shows that a strong intrablock orthogonalization is only necessary for the first block to maintain orthogonality at unit roundoff. It derives stability bounds for both non-reorthogonalized and reorthogonalized variants, demonstrating that reducing synchronization points degrades stability, and identifies a viable one-sync reorthogonalized method for the column case. The results reveal that DCGS2 and CGS-2 are as stable as Householder QR in the single-column setting, while block methods face stringent condition-number restrictions as sync-points are removed. Numerical experiments in BlockStab illustrate the trade-offs between intraorthogonalization subroutines and synchronization efficiency. The findings guide the design of communication-avoiding orthogonalization schemes suitable for exascale computing, highlighting both possibilities and limitations of low-sync variants.

Abstract

Interest in communication-avoiding orthogonalization schemes for high-performance computing has been growing recently. This manuscript addresses open questions about the numerical stability of various block classical Gram-Schmidt variants that have been proposed in the past few years. An abstract framework is employed, the flexibility of which allows for new rigorous bounds on the loss of orthogonality in these variants. We first analyze a generalization of (reorthogonalized) block classical Gram-Schmidt and show that a "strong" intrablock orthogonalization routine is only needed for the very first block in order to maintain orthogonality on the level of the unit roundoff. In particular, this ``strong" first step does not have to be a reorthogonalized QR itself and subsequent steps can use less stable QR variants, thus keeping the overall communication costs low. Then, using this variant, which has four synchronization points per block column, we remove the synchronization points one at a time and analyze how each alteration affects the stability of the resulting method. Our analysis shows that the variant requiring only one synchronization per block column cannot be guaranteed to be stable in practice, as stability begins to degrade with the first reduction of synchronization points. Our analysis of block methods also provides new theoretical results for the single-column case. In particular, it is proven that DCGS2 from [Bielich, D. et al. Par. Comput. 112 (2022)] and CGS-2 from [Świrydowicz, K. et al, Num. Lin. Alg. Appl. 28 (2021)] are as stable as Householder QR. Numerical examples from the BlockStab toolbox are included throughout, to help compare variants and illustrate the effects of different choices of intraorthogonalization subroutines.

On the loss of orthogonality in low-synchronization variants of reorthogonalized block classical Gram-Schmidt

TL;DR

This paper develops an abstract framework to analyze the numerical stability of low-synchronization block Gram-Schmidt algorithms and shows that a strong intrablock orthogonalization is only necessary for the first block to maintain orthogonality at unit roundoff. It derives stability bounds for both non-reorthogonalized and reorthogonalized variants, demonstrating that reducing synchronization points degrades stability, and identifies a viable one-sync reorthogonalized method for the column case. The results reveal that DCGS2 and CGS-2 are as stable as Householder QR in the single-column setting, while block methods face stringent condition-number restrictions as sync-points are removed. Numerical experiments in BlockStab illustrate the trade-offs between intraorthogonalization subroutines and synchronization efficiency. The findings guide the design of communication-avoiding orthogonalization schemes suitable for exascale computing, highlighting both possibilities and limitations of low-sync variants.

Abstract

Interest in communication-avoiding orthogonalization schemes for high-performance computing has been growing recently. This manuscript addresses open questions about the numerical stability of various block classical Gram-Schmidt variants that have been proposed in the past few years. An abstract framework is employed, the flexibility of which allows for new rigorous bounds on the loss of orthogonality in these variants. We first analyze a generalization of (reorthogonalized) block classical Gram-Schmidt and show that a "strong" intrablock orthogonalization routine is only needed for the very first block in order to maintain orthogonality on the level of the unit roundoff. In particular, this ``strong" first step does not have to be a reorthogonalized QR itself and subsequent steps can use less stable QR variants, thus keeping the overall communication costs low. Then, using this variant, which has four synchronization points per block column, we remove the synchronization points one at a time and analyze how each alteration affects the stability of the resulting method. Our analysis shows that the variant requiring only one synchronization per block column cannot be guaranteed to be stable in practice, as stability begins to degrade with the first reduction of synchronization points. Our analysis of block methods also provides new theoretical results for the single-column case. In particular, it is proven that DCGS2 from [Bielich, D. et al. Par. Comput. 112 (2022)] and CGS-2 from [Świrydowicz, K. et al, Num. Lin. Alg. Appl. 28 (2021)] are as stable as Householder QR. Numerical examples from the BlockStab toolbox are included throughout, to help compare variants and illustrate the effects of different choices of intraorthogonalization subroutines.
Paper Structure (12 sections, 18 theorems, 176 equations, 11 figures, 3 tables, 5 algorithms)

This paper contains 12 sections, 18 theorems, 176 equations, 11 figures, 3 tables, 5 algorithms.

Key Result

Lemma 1

Assume that $\bar{\bm{G}}$, $\tilde{\bm{G}}$, $\bar{\bm{Q}}$, $\bar{R}$, and $\bar{\bm{\mathcal{Q}}}_{prev}$ satisfy eq:epsproj, eq:epsqr, and eq:epsQkp, and that is satisfied. Furthermore, assume that $\bar{R}$ is nonsingular. Then and $\bar{\bm{\mathcal{Q}}}_{new} = [\bar{\bm{\mathcal{Q}}}_{prev}, \bar{\bm{Q}}]$ satisfies

Figures (11)

  • Figure 1: Comparison among \ref{['alg:BCGSA']}, \ref{['alg:BCGSA']}, and \ref{['alg:BCGSIROA']} on a class of monomial matrices from BlockStab.
  • Figure 2: Comparison between \ref{['alg:BCGSIROA']} (i.e., Algorithm \ref{['alg:BCGSIROA']} with all IOs equal) and \ref{['alg:BCGSIROA']} ($\texttt{IO}_{\mathrm{A}}\xspace = \texttt{HouseQR}\xspace$ and $\texttt{IO}_1\xspace = \texttt{IO}_2\xspace$) on a class of monomial matrices.
  • Figure 3: Comparison between \ref{['alg:BCGSIROA']} (i.e., Algorithm \ref{['alg:BCGSIROA']} with all IOs equal) and \ref{['alg:BCGSIROA']} ($\texttt{IO}_{\mathrm{A}}\xspace = \texttt{HouseQR}\xspace$ and $\texttt{IO}_1\xspace = \texttt{IO}_2\xspace$) on a class of piled matrices.
  • Figure 4: Comparison between \ref{['alg:BCGSIROA']} and \ref{['alg:BCGSIROA3S']} on a class of monomial matrices. Note that $\texttt{IO}_{\mathrm{A}}\xspace$ is fixed as HouseQR, and $\texttt{IO}\xspace = \texttt{IO}_1\xspace = \texttt{IO}_2\xspace$.
  • Figure 5: Comparison among low-sync versions of \ref{['alg:BCGSIROA']} on a class of monomial matrices. Note that $\texttt{IO}_{\mathrm{A}}\xspace$ is fixed as HouseQR.
  • ...and 6 more figures

Theorems & Definitions (33)

  • Lemma 1
  • Lemma 2
  • proof
  • proof : Proof of Lemma \ref{['lem:epsQk-relation']}
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • Theorem 1
  • ...and 23 more