Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

Andres E. Tomas; Enrique S. Quintana-Orti; Hartwig Anzt

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

Andres E. Tomas, Enrique S. Quintana-Orti, Hartwig Anzt

TL;DR

The paper addresses efficient computation of low-rank matrix approximations via truncated SVD on GPUs, by developing GPU-optimized implementations of randomized SVD (RandSVD) and a blocked Lanczos-based SVD (LancSVD). It shows that both methods can be expressed through common linear-algebra building blocks and mapped to GPU kernels (matrix multiplies, orthogonalizations, and small SVDs), with a detailed cost and performance analysis on sparse and dense matrices. Empirical results on real Suite Sparse matrices and synthetic dense matrices reveal that LancSVD typically outperforms RandSVD at the same target accuracy, mainly due to faster convergence and the high cost of transposed SpMM on GPUs. The findings suggest that GPU-accelerated building blocks, particularly for SpMM and orthogonalization, enable practical and scalable truncated SVD for large-scale sparse and dense problems, with future work pointing to improved restart strategies to further reduce total multiplications.

Abstract

We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy.

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

TL;DR

Abstract

Paper Structure (23 sections, 16 equations, 4 figures, 2 tables, 5 algorithms)

This paper contains 23 sections, 16 equations, 4 figures, 2 tables, 5 algorithms.

Introduction
Truncated SVD
Randomized SVD
Overview.
Building blocks.
Role of the parameters $p$ and $r$.
Block Lanczos SVD
Overview.
Building blocks.
Role of the parameter $b$.
Role of the parameter $r$.
Role of the paramater $p$.
Building Blocks on GPUs
QR factorization via block Gram-Schmidt
Orthogonalization via CholeskyQR2
...and 8 more sections

Figures (4)

Figure 1: Relative residuals ${\cal R}_{1}$ (top) and ${\cal R}_{10}$ (bottom) for the solutions computed with RandSVD and LancSVD and different values of $r$ and $p$. In all cases, $b=16$.
Figure 2: Execution time of LancSVD and RandSVD (top and middle, respectively) and speed-up of LancSVD with respect to RandSVD (bottom).
Figure 3: Distribution of the flops across the major building blocks in LancSVD and RandSVD (top and bottom, respectively).
Figure 4: Relative residuals ${\cal R}_{1}$ to ${\cal R}_{10}$ for the solutions computed with the LancSVD and RandSVD and different values of $r$ and $p$ (top) and execution time (bottom). In all cases, $b=16$.

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

TL;DR

Abstract

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

Authors

TL;DR

Abstract

Table of Contents

Figures (4)