Table of Contents
Fetching ...

CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT)

Maksim Melnichenko, Oleg Balabanov, Riley Murray, James Demmel, Michael W. Mahoney, Piotr Luszczek

TL;DR

CQRRPT addresses the expensive pivoting cost in QR with column pivoting for tall matrices by injecting a single randomized sketch, followed by a deterministic CholeskyQR-based preconditioning. The method yields a pivoted QR decomposition that preserves rank-revealing and stability properties under standard RandNLA assumptions, with a leading arithmetic cost of $3m n^{2}$ (plus sketch terms) and favorable communication characteristics. The authors provide rigorous RRQR and stability results, discuss practical rank estimation, and demonstrate substantial speedups over LAPACK's GEQP3 on large tall matrices, while maintaining explicit $\mathbf{Q}$ factors. The work positions CQRRPT as a robust, scalable tool for orthogonalization in high-performance computing, with open-source RandLAPACK implementations and directions for future sketching improvements and GPU/low-precision extensions.

Abstract

This paper develops and analyzes a new algorithm for QR decomposition with column pivoting (QRCP) of rectangular matrices with many more rows than columns. The algorithm carefully combines methods from randomized numerical linear algebra to accelerate pivot decisions for the input matrix and the process of decomposing the pivoted matrix into the QR form. The source of the latter improvement is CholeskyQR with randomized preconditioning. Comprehensive analysis is provided in both exact and finite-precision arithmetic to characterize the algorithm's rank-revealing properties and its numerical stability granted probabilistic assumptions of the sketching operator. An implementation of the proposed algorithm is described and made available inside the open-source RandLAPACK library, which itself relies on RandBLAS. Experiments with this implementation on an Intel Xeon Gold 6248R CPU demonstrate order-of-magnitude speedups over LAPACK's standard function for QRCP, and comparable performance to a specialized algorithm for unpivoted QR of tall matrices, which lacks the strong rank-revealing properties of the proposed method.

CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT)

TL;DR

CQRRPT addresses the expensive pivoting cost in QR with column pivoting for tall matrices by injecting a single randomized sketch, followed by a deterministic CholeskyQR-based preconditioning. The method yields a pivoted QR decomposition that preserves rank-revealing and stability properties under standard RandNLA assumptions, with a leading arithmetic cost of (plus sketch terms) and favorable communication characteristics. The authors provide rigorous RRQR and stability results, discuss practical rank estimation, and demonstrate substantial speedups over LAPACK's GEQP3 on large tall matrices, while maintaining explicit factors. The work positions CQRRPT as a robust, scalable tool for orthogonalization in high-performance computing, with open-source RandLAPACK implementations and directions for future sketching improvements and GPU/low-precision extensions.

Abstract

This paper develops and analyzes a new algorithm for QR decomposition with column pivoting (QRCP) of rectangular matrices with many more rows than columns. The algorithm carefully combines methods from randomized numerical linear algebra to accelerate pivot decisions for the input matrix and the process of decomposing the pivoted matrix into the QR form. The source of the latter improvement is CholeskyQR with randomized preconditioning. Comprehensive analysis is provided in both exact and finite-precision arithmetic to characterize the algorithm's rank-revealing properties and its numerical stability granted probabilistic assumptions of the sketching operator. An implementation of the proposed algorithm is described and made available inside the open-source RandLAPACK library, which itself relies on RandBLAS. Experiments with this implementation on an Intel Xeon Gold 6248R CPU demonstrate order-of-magnitude speedups over LAPACK's standard function for QRCP, and comparable performance to a specialized algorithm for unpivoted QR of tall matrices, which lacks the strong rank-revealing properties of the proposed method.
Paper Structure (56 sections, 14 theorems, 67 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 56 sections, 14 theorems, 67 equations, 8 figures, 2 tables, 2 algorithms.

Key Result

Theorem 2

If $\kappa(\bm{\mathsf{S}}|\mathop{\mathrm{range}}\nolimits(\bm{\mathsf{M}}))$ is finite, then $[\bm{\mathsf{Q}}_k, \bm{\mathsf{R}}_k, J] = \normalfont\texttt{cqrrpt\_core}(\bm{\mathsf{M}},\bm{\mathsf{S}})$ define a column-pivoted QR decomposition of $\bm{\mathsf{M}}$ in the sense of def:qrcp_alg.

Figures (8)

  • Figure 1: Pivot quality results for low-coherence matrices with two types of spectral decay.
  • Figure 2: Pivot quality results for a high-coherence matrix under two choices of sketching distribution parameters.
  • Figure 3: QR schemes performance comparisons for matrices with fixed numbers of rows ($2^{16} = 65536$, and $2^{17} = 131072$ respectively) and varying numbers of columns ($512, \ldots, 8192$). In the case with $131072$ rows, CQRRPT remains the fastest algorithm up until $n = 8192$; at that point, operations on submatrices in underlying LAPACK routines no longer fit in the cache.
  • Figure 4: Percentages of CQRRPT's runtime, occupied by its respective subroutines. Note that the cost of sketching becomes negligible for larger matrices. Note also that when $d \geq n$, the cost of applying QRCP to the $d \times n$ sketch grows as $\Omega(n^3)$. By contrast, the cost of applying CholeskyQR to the $m \times n$ preconditioned matrix grows as $\mathcal{O}(m n^2)$. Therefore, it is reasonable that QRCP consumes a larger fraction of runtime as $n$ increases.
  • Figure 5: Effect of varying the sampling factor $\gamma \in \{1, 1.5, 2, ..., 4\}$ for matrices of sizes $131072 \times \{1024, 2048, 4096\}$. Runtime represents the wall clock time for the full execution of CQRRPT. An increase of the embedding dimension has a larger effect on wider matrices, as QRCP becomes more expensive with the increased number of columns, as shown in \ref{['CQRRPT Inner Speed Fig 1']}. Note that our default value of $\gamma = 1.25$ is marked with X; performance in this case can be inferred by interpolating between the first two data points of each series.
  • ...and 3 more figures

Theorems & Definitions (28)

  • Remark 1
  • Definition 1
  • Remark 2
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • Theorem 5
  • Definition 6
  • ...and 18 more