QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems

Nenad Mijić; Abhiram Kaushik; Davor Davidović

QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems

Nenad Mijić, Abhiram Kaushik, Davor Davidović

TL;DR

This work tackles QR factorization for extremely ill-conditioned tall-and-skinny matrices on distributed-memory and multi-GPU systems. It introduces Modified CholeskyQR2 with Gram-Schmidt (mCQR2GS), a distributed algorithm that interleaves CholeskyQR steps with Gram-Schmidt re-orthogonalisation to achieve $O(u)$-level orthogonality even for matrices with $ ext{cond}(A)$ up to approximately $10^{16}$. The approach builds on CholeskyQR2 and Shifted CholeskyQR3, adding a robust panel-based strategy that reduces communication and improves stability, outperforming ScaLAPACK by up to 6x on CPUs and 80x on GPUs in weak scaling. The paper provides a detailed scalability analysis, discusses the trade-offs of panel width and paneling strategies, and outlines future enhancements (look-ahead, adaptive paneling, shifting) with public code availability on GitHub and Zenodo.

Abstract

In this paper we present a novel algorithm developed for computing the QR factorisation of extremely ill-conditioned tall-and-skinny matrices on distributed memory systems. The algorithm is based on the communication-avoiding CholeskyQR2 algorithm and its block Gram-Schmidt variant. The latter improves the numerical stability of the CholeskyQR2 algorithm and significantly reduces the loss of orthogonality even for matrices with condition numbers up to $10^{15}$. Currently, there is no distributed GPU version of this algorithm available in the literature which prevents the application of this method to very large matrices. In our work we provide a distributed implementation of this algorithm and also introduce a modified version that improves the performance, especially in the case of extremely ill-conditioned matrices. The main innovation of our approach lies in the interleaving of the CholeskyQR steps with the Gram-Schmidt orthogonalisation, which ensures that update steps are performed with fully orthogonalised panels. The obtained orthogonality and numerical stability of our modified algorithm is equivalent to CholeskyQR2 with Gram-Schmidt and other state-of-the-art methods. Weak scaling tests performed with our test matrices show significant performance improvements. In particular, our algorithm outperforms state-of-the-art Householder-based QR factorisation algorithms available in ScaLAPACK by a factor of $6$ on CPU-only systems and up to $80\times$ on GPU-based systems with distributed memory.

QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems

TL;DR

-level orthogonality even for matrices with

up to approximately

. The approach builds on CholeskyQR2 and Shifted CholeskyQR3, adding a robust panel-based strategy that reduces communication and improves stability, outperforming ScaLAPACK by up to 6x on CPUs and 80x on GPUs in weak scaling. The paper provides a detailed scalability analysis, discusses the trade-offs of panel width and paneling strategies, and outlines future enhancements (look-ahead, adaptive paneling, shifting) with public code availability on GitHub and Zenodo.

Abstract

. Currently, there is no distributed GPU version of this algorithm available in the literature which prevents the application of this method to very large matrices. In our work we provide a distributed implementation of this algorithm and also introduce a modified version that improves the performance, especially in the case of extremely ill-conditioned matrices. The main innovation of our approach lies in the interleaving of the CholeskyQR steps with the Gram-Schmidt orthogonalisation, which ensures that update steps are performed with fully orthogonalised panels. The obtained orthogonality and numerical stability of our modified algorithm is equivalent to CholeskyQR2 with Gram-Schmidt and other state-of-the-art methods. Weak scaling tests performed with our test matrices show significant performance improvements. In particular, our algorithm outperforms state-of-the-art Householder-based QR factorisation algorithms available in ScaLAPACK by a factor of

on CPU-only systems and up to

on GPU-based systems with distributed memory.

Paper Structure (13 sections, 6 equations, 10 figures, 2 tables, 9 algorithms)

This paper contains 13 sections, 6 equations, 10 figures, 2 tables, 9 algorithms.

Introduction
Testing environment
Testing platform
Test matrix suite
Software
CholeskyQR
CholeskyQR2
CholeskyQR variants for extremely ill-conditioned matrices
Shifted CholeskyQR3
CholeskyQR2 with Gram-Schmidt
Modified CholeskyQR2 with Gram-Schmidt
Scalability analysis
Conclusion

Figures (10)

Figure 1: Orthogonality and residuals of sCQR3 and CQR2 as a function of the condition number, for input matrices with $m=30000$, $n=3000$ and conservative shift for sCQR3.
Figure 2: Distributing and slicing of a matrix. An example with 2 processors. The assignments of blocks and panels with processors are indicated on the vertical axis.
Figure 3: CQR2GS: Orthogonality of $Q$ as a function of panel size, for ill-condition input matrices with $m=30000$, $n=3000$.
Figure 4: Time to solution of CQR2GS on 4 GPUs as a function of panel size for well-condition input matrices ($\kappa(A) = 10^4$) with $m=\{30000, 300000\}$ and a fixed number of columns $n=3000$.
Figure 5: Graphical overview of matrix distribution on 4 ranks and computational operations on local matrix data.
...and 5 more figures

QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems

TL;DR

Abstract

QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems

Authors

TL;DR

Abstract

Table of Contents

Figures (10)