Table of Contents
Fetching ...

Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures

Mónica Chillarón, Gregorio Quintana-Ortí, Vicente Vidal, Per-Gunnar Martinsson

TL;DR

The paper tackles solving very large, potentially rank-deficient linear least-squares problems when data do not fit in main memory. It replaces traditional SVD/CPQR approaches with randUTV, enabling robust, rank-revealing decompositions in Out-Of-Core settings and on GPUs. Through precision tests and extensive performance experiments, the authors demonstrate competitive accuracy with state-of-the-art in-core methods and show substantial performance gains from blocking, algorithm-by-blocks, and optimized OOC implementations. The work delivers CPU and GPU implementations with advanced data-management strategies, achieving practical scalability for large dense or rank-deficient systems. Overall, the randUTV-based LS solvers offer a viable, scalable solution for very large problems on modern architectures, with strong precision and competitive speed.

Abstract

Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as the solution with smallest norm in cases where the coefficient matrix has a nontrivial nullspace. This work presents several new techniques for solving least squares problems involving coefficient matrices that are so large that they do not fit in main memory. The implementations include both CPU and GPU variants. All techniques rely on complete orthogonal decompositions that guarantee that both conditions of a least squares solution are met, regardless of the rank properties of the matrix. Specifically, they rely on the recently proposed "randUTV" algorithm that is particularly effective in strongly communication-constrained environments. A detailed precision and performance study reveals that the new methods, that operate on data stored on disk, are competitive with state-of-the-art methods that store all data in main memory.

Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures

TL;DR

The paper tackles solving very large, potentially rank-deficient linear least-squares problems when data do not fit in main memory. It replaces traditional SVD/CPQR approaches with randUTV, enabling robust, rank-revealing decompositions in Out-Of-Core settings and on GPUs. Through precision tests and extensive performance experiments, the authors demonstrate competitive accuracy with state-of-the-art in-core methods and show substantial performance gains from blocking, algorithm-by-blocks, and optimized OOC implementations. The work delivers CPU and GPU implementations with advanced data-management strategies, achieving practical scalability for large dense or rank-deficient systems. Overall, the randUTV-based LS solvers offer a viable, scalable solution for very large problems on modern architectures, with strong precision and competitive speed.

Abstract

Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as the solution with smallest norm in cases where the coefficient matrix has a nontrivial nullspace. This work presents several new techniques for solving least squares problems involving coefficient matrices that are so large that they do not fit in main memory. The implementations include both CPU and GPU variants. All techniques rely on complete orthogonal decompositions that guarantee that both conditions of a least squares solution are met, regardless of the rank properties of the matrix. Specifically, they rely on the recently proposed "randUTV" algorithm that is particularly effective in strongly communication-constrained environments. A detailed precision and performance study reveals that the new methods, that operate on data stored on disk, are competitive with state-of-the-art methods that store all data in main memory.
Paper Structure (30 sections, 9 equations, 12 figures, 12 tables)

This paper contains 30 sections, 9 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: The sparsity patterns of the four matrices $T^{(i)}$ that appear in randUTV, shown for the particular case where $m=11, n=8$, and $n_b=3$. The size of each element is a rough approximation of its absolute value.
  • Figure 2: The randUTV algorithm written with the FLAME methodology/notation.
  • Figure 3: The algorithm to nullify the top right part of $T$ written with the FLAME methodology/notation. The input arguments are the following: $T$ is the upper triangular factor of the randUTV factorization of $A$, $V$ is the right orthogonal matrix of the randUTV factorization, and $r$ is the numeric rank of $T$.
  • Figure 4: The overall algorithm for solving $Ax=b$. The input arguments are the following: $A$ is the coefficient matrix of dimension $m \times n$, $b$ is the independent vector of dimension $m \times 1$, and $q$ is the number of steps in the power iteration process.
  • Figure 5: An illustration of the first tasks performed by a blocked algorithm for computing the QR factorization. The '$\bullet$' symbol represents an element not modified by the current task, '$\star$' represents an element modified by the current task, and '$\circ$' represents a nullified element. The continuous lines surround the blocks involved in the current task.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Remark 1: Connection to randomized SVD
  • Remark 2: Fast option