Table of Contents
Fetching ...

GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

Henrik Möllmann, Dirk Pflüger, Alexander Strack

TL;DR

This work extends the GPRat library by incorporating a fully GPU-resident GP prediction pipeline, and implements tiled algorithms for the GP prediction using optimized CUDA libraries, thereby exploiting massive parallelism for linear algebra operations.

Abstract

Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction pipeline. GPRat is an HPX-based library that combines task-based parallelism with an intuitive Python API. We implement tiled algorithms for the GP prediction using optimized CUDA libraries, thereby exploiting massive parallelism for linear algebra operations. We evaluate the optimal number of CUDA streams and compare the performance of our GPU implementation to the existing CPU-based implementation. Our results show the GPU implementation provides speedups for datasets larger than 128 training samples. We observe speedups of up to 4.3 for the Cholesky decomposition itself and 4.6 for the GP prediction. Furthermore, combining HPX with multiple CUDA streams allows GPRat to match, and for large datasets, surpass cuSOLVER's performance by up to 11 percent.

GPU-Resident Gaussian Process Regression Leveraging Asynchronous Tasks with HPX

TL;DR

This work extends the GPRat library by incorporating a fully GPU-resident GP prediction pipeline, and implements tiled algorithms for the GP prediction using optimized CUDA libraries, thereby exploiting massive parallelism for linear algebra operations.

Abstract

Gaussian processes (GPs) are a widely used regression tool, but the cubic complexity of exact solvers limits their scalability. To address this challenge, we extend the GPRat library by incorporating a fully GPU-resident GP prediction pipeline. GPRat is an HPX-based library that combines task-based parallelism with an intuitive Python API. We implement tiled algorithms for the GP prediction using optimized CUDA libraries, thereby exploiting massive parallelism for linear algebra operations. We evaluate the optimal number of CUDA streams and compare the performance of our GPU implementation to the existing CPU-based implementation. Our results show the GPU implementation provides speedups for datasets larger than 128 training samples. We observe speedups of up to 4.3 for the Cholesky decomposition itself and 4.6 for the GP prediction. Furthermore, combining HPX with multiple CUDA streams allows GPRat to match, and for large datasets, surpass cuSOLVER's performance by up to 11 percent.
Paper Structure (14 sections, 4 equations, 7 figures)

This paper contains 14 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Tiled Cholesky decomposition of $\mathbf K$.
  • Figure 2: Example for matrix $\mathbf K$ split into $5 \times 5$ tiles, colored by tasks in the first iteration ($J=0$): POTRF, TRSM, SYRK, and GEMM.
  • Figure 3: Runtime of Cholesky on GPU for a problem size of $n=32768.0$ with varying CUDA streams and tiles.
  • Figure 4: Breakdown of Cholesky runtime steps on GPU for a problem size of $n=32768.0$ and $32$ streams with varying tiles.
  • Figure 5: Cholesky in NVIDIA Visual Profiler for $n = 4096.0$ training samples, four tiles, and four streams.
  • ...and 2 more figures