Fast Truncated SVD of Sparse and Dense Matrices on Graphics Processors
Andres E. Tomas, Enrique S. Quintana-Orti, Hartwig Anzt
TL;DR
The paper addresses efficient computation of low-rank matrix approximations via truncated SVD on GPUs, by developing GPU-optimized implementations of randomized SVD (RandSVD) and a blocked Lanczos-based SVD (LancSVD). It shows that both methods can be expressed through common linear-algebra building blocks and mapped to GPU kernels (matrix multiplies, orthogonalizations, and small SVDs), with a detailed cost and performance analysis on sparse and dense matrices. Empirical results on real Suite Sparse matrices and synthetic dense matrices reveal that LancSVD typically outperforms RandSVD at the same target accuracy, mainly due to faster convergence and the high cost of transposed SpMM on GPUs. The findings suggest that GPU-accelerated building blocks, particularly for SpMM and orthogonalization, enable practical and scalable truncated SVD for large-scale sparse and dense problems, with future work pointing to improved restart strategies to further reduce total multiplications.
Abstract
We investigate the solution of low-rank matrix approximation problems using the truncated SVD. For this purpose, we develop and optimize GPU implementations for the randomized SVD and a blocked variant of the Lanczos approach. Our work takes advantage of the fact that the two methods are composed of very similar linear algebra building blocks, which can be assembled using numerical kernels from existing high-performance linear algebra libraries. Furthermore, the experiments with several sparse matrices arising in representative real-world applications and synthetic dense test matrices reveal a performance advantage of the block Lanczos algorithm when targeting the same approximation accuracy.
