Table of Contents
Fetching ...

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

Andres E. Tomas, Enrique S. Quintana-Orti, Hartwig Anzt

TL;DR

The paper addresses efficient computation of low-rank matrix approximations via truncated SVD on GPUs, by developing GPU-optimized implementations of randomized SVD (RandSVD) and a blocked Lanczos-based SVD (LancSVD). It shows that both methods can be expressed through common linear-algebra building blocks and mapped to GPU kernels (matrix multiplies, orthogonalizations, and small SVDs), with a detailed cost and performance analysis on sparse and dense matrices. Empirical results on real Suite Sparse matrices and synthetic dense matrices reveal that LancSVD typically outperforms RandSVD at the same target accuracy, mainly due to faster convergence and the high cost of transposed SpMM on GPUs. The findings suggest that GPU-accelerated building blocks, particularly for SpMM and orthogonalization, enable practical and scalable truncated SVD for large-scale sparse and dense problems, with future work pointing to improved restart strategies to further reduce total multiplications.

Abstract

We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy.

Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors

TL;DR

The paper addresses efficient computation of low-rank matrix approximations via truncated SVD on GPUs, by developing GPU-optimized implementations of randomized SVD (RandSVD) and a blocked Lanczos-based SVD (LancSVD). It shows that both methods can be expressed through common linear-algebra building blocks and mapped to GPU kernels (matrix multiplies, orthogonalizations, and small SVDs), with a detailed cost and performance analysis on sparse and dense matrices. Empirical results on real Suite Sparse matrices and synthetic dense matrices reveal that LancSVD typically outperforms RandSVD at the same target accuracy, mainly due to faster convergence and the high cost of transposed SpMM on GPUs. The findings suggest that GPU-accelerated building blocks, particularly for SpMM and orthogonalization, enable practical and scalable truncated SVD for large-scale sparse and dense problems, with future work pointing to improved restart strategies to further reduce total multiplications.

Abstract

We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy.
Paper Structure (23 sections, 16 equations, 4 figures, 2 tables, 5 algorithms)

This paper contains 23 sections, 16 equations, 4 figures, 2 tables, 5 algorithms.

Figures (4)

  • Figure 1: Relative residuals ${\cal R}_{1}$ (top) and ${\cal R}_{10}$ (bottom) for the solutions computed with RandSVD and LancSVD and different values of $r$ and $p$. In all cases, $b=16$.
  • Figure 2: Execution time of LancSVD and RandSVD (top and middle, respectively) and speed-up of LancSVD with respect to RandSVD (bottom).
  • Figure 3: Distribution of the flops across the major building blocks in LancSVD and RandSVD (top and bottom, respectively).
  • Figure 4: Relative residuals ${\cal R}_{1}$ to ${\cal R}_{10}$ for the solutions computed with the LancSVD and RandSVD and different values of $r$ and $p$ (top) and execution time (bottom). In all cases, $b=16$.