QR factorization of ill-conditioned tall-and-skinny matrices on distributed-memory systems
Nenad Mijić, Abhiram Kaushik, Davor Davidović
TL;DR
This work tackles QR factorization for extremely ill-conditioned tall-and-skinny matrices on distributed-memory and multi-GPU systems. It introduces Modified CholeskyQR2 with Gram-Schmidt (mCQR2GS), a distributed algorithm that interleaves CholeskyQR steps with Gram-Schmidt re-orthogonalisation to achieve $O(u)$-level orthogonality even for matrices with $ ext{cond}(A)$ up to approximately $10^{16}$. The approach builds on CholeskyQR2 and Shifted CholeskyQR3, adding a robust panel-based strategy that reduces communication and improves stability, outperforming ScaLAPACK by up to 6x on CPUs and 80x on GPUs in weak scaling. The paper provides a detailed scalability analysis, discusses the trade-offs of panel width and paneling strategies, and outlines future enhancements (look-ahead, adaptive paneling, shifting) with public code availability on GitHub and Zenodo.
Abstract
In this paper we present a novel algorithm developed for computing the QR factorisation of extremely ill-conditioned tall-and-skinny matrices on distributed memory systems. The algorithm is based on the communication-avoiding CholeskyQR2 algorithm and its block Gram-Schmidt variant. The latter improves the numerical stability of the CholeskyQR2 algorithm and significantly reduces the loss of orthogonality even for matrices with condition numbers up to $10^{15}$. Currently, there is no distributed GPU version of this algorithm available in the literature which prevents the application of this method to very large matrices. In our work we provide a distributed implementation of this algorithm and also introduce a modified version that improves the performance, especially in the case of extremely ill-conditioned matrices. The main innovation of our approach lies in the interleaving of the CholeskyQR steps with the Gram-Schmidt orthogonalisation, which ensures that update steps are performed with fully orthogonalised panels. The obtained orthogonality and numerical stability of our modified algorithm is equivalent to CholeskyQR2 with Gram-Schmidt and other state-of-the-art methods. Weak scaling tests performed with our test matrices show significant performance improvements. In particular, our algorithm outperforms state-of-the-art Householder-based QR factorisation algorithms available in ScaLAPACK by a factor of $6$ on CPU-only systems and up to $80\times$ on GPU-based systems with distributed memory.
