Solving Large Rank-Deficient Linear Least-Squares Problems on Shared-Memory CPU Architectures and GPU Architectures
Mónica Chillarón, Gregorio Quintana-Ortí, Vicente Vidal, Per-Gunnar Martinsson
TL;DR
The paper tackles solving very large, potentially rank-deficient linear least-squares problems when data do not fit in main memory. It replaces traditional SVD/CPQR approaches with randUTV, enabling robust, rank-revealing decompositions in Out-Of-Core settings and on GPUs. Through precision tests and extensive performance experiments, the authors demonstrate competitive accuracy with state-of-the-art in-core methods and show substantial performance gains from blocking, algorithm-by-blocks, and optimized OOC implementations. The work delivers CPU and GPU implementations with advanced data-management strategies, achieving practical scalability for large dense or rank-deficient systems. Overall, the randUTV-based LS solvers offer a viable, scalable solution for very large problems on modern architectures, with strong precision and competitive speed.
Abstract
Solving very large linear systems of equations is a key computational task in science and technology. In many cases, the coefficient matrix of the linear system is rank-deficient, leading to systems that may be underdetermined, inconsistent, or both. In such cases, one generally seeks to compute the least squares solution that minimizes the residual of the problem, which can be further defined as the solution with smallest norm in cases where the coefficient matrix has a nontrivial nullspace. This work presents several new techniques for solving least squares problems involving coefficient matrices that are so large that they do not fit in main memory. The implementations include both CPU and GPU variants. All techniques rely on complete orthogonal decompositions that guarantee that both conditions of a least squares solution are met, regardless of the rank properties of the matrix. Specifically, they rely on the recently proposed "randUTV" algorithm that is particularly effective in strongly communication-constrained environments. A detailed precision and performance study reveals that the new methods, that operate on data stored on disk, are competitive with state-of-the-art methods that store all data in main memory.
