Table of Contents
Fetching ...

Efficient GPU implementation of randomized SVD and its applications

Łukasz Struski, Paweł Morkisz, Przemysław Spurek, Samuel Rodriguez Bernabeu, Tomasz Trzciński

TL;DR

This work reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks and shows that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs.

Abstract

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation (https://docs.nvidia.com/cuda/cusolver/index.html).

Efficient GPU implementation of randomized SVD and its applications

TL;DR

This work reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks and shows that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs.

Abstract

Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation (https://docs.nvidia.com/cuda/cusolver/index.html).

Paper Structure

This paper contains 9 sections, 5 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Average time and standard deviation of 10 attempts to run competing methods in relation to run time of our method (speed-up). We consider two types of methods that calculate the whole spectrum and that calculate only $k$ largest eigenvalues.
  • Figure 2: The speed-up of other methods to ours in the 'fast decay' case of comparison. We show mean values of 10 runs as the line with standard deviation as the light areas. In this case we create matrix $A^{2000\times n}$, where $n$ is the number of columns and we calculate 1%, 3%, 5%, 10% of the largest eigenvalues for it.
  • Figure 3: The speed-up of other methods to ours in the 'sharp decay' case of comparison. We show mean values of 10 runs as the line with standard deviation as the light areas. In this case we create matrix $A^{2000\times n}$, where $n$ is number of columns and we calculate 1%, 3%, 5%, 10% the largest eigenvalues for it.
  • Figure 4: The speed-up of other methods to ours in the 'sharp decay' case of comparison. We show mean values of 10 runs as the line with standard deviation as the light areas. In this case we create matrix $A^{2000\times n}$, where $n$ is number of columns and we calculate 1%, 3%, 5%, 10% the largest eigenvalues for it. The dashed black line has a value of 1 (a reference to our method).