Table of Contents
Fetching ...

Inverse-Free Sparse Variational Gaussian Processes

Stefano Cortinovis, Laurence Aitchison, Stefanos Eleftheriadis, Mark van der Wilk

Abstract

Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.

Inverse-Free Sparse Variational Gaussian Processes

Abstract

Gaussian processes (GPs) offer appealing properties but are costly to train at scale. Sparse variational GP (SVGP) approximations reduce cost yet still rely on Cholesky decompositions of kernel matrices, ill-suited to low-precision, massively parallel hardware. While one can construct valid variational bounds that rely only on matrix multiplications (matmuls) via an auxiliary matrix parameter, optimising them with off-the-shelf first-order methods is challenging. We make the inverse-free approach practical by proposing a better-conditioned bound and deriving a matmul-only natural-gradient update for the auxiliary parameter, markedly improving stability and convergence. We further provide simple heuristics, such as step-size schedules and stopping criteria, that make the overall optimisation routine fit seamlessly into existing workflows. Across regression and classification benchmarks, we demonstrate that our method 1) serves as a drop-in replacement in SVGP-based models (e.g., deep GPs), 2) recovers similar performance to traditional methods, and 3) can be faster than baselines when well tuned.

Paper Structure

This paper contains 43 sections, 3 theorems, 65 equations, 3 figures, 2 tables.

Key Result

Proposition 1

Let $\ell(\mathbf{L})$ be as in eq:natgrad_loss. Then, where $\mathop{\mathrm{tril}}\nolimits(\cdot)$ and $\mathop{\mathrm{diag}}\nolimits(\cdot)$ return the lower triangular part and the diagonal of a matrix. $\blacktriangleleft$$\blacktriangleleft$

Figures (3)

  • Figure 1: Loss traces on snelson and banana datasets. The N and P suffixes refer to the use of NG updates (\ref{['sec:natgrad']}) and inducing mean preconditioning (\ref{['sec:preconditioning']}), respectively. R-SVGP (NP) is the only inverse-free variant that matches the performance of W-SVGP and L-SVGP (P), which use Cholesky decompositions.
  • Figure 2: NLPD/runtime on elevators and kin40k datasets for different choices of $M$ and batch size $B = 100$. Lines show the mean over 5 seeds; shaded regions indicate $\pm 1$ standard error (often smaller than line width).
  • Figure S1: NLPD/runtime on elevators and kin40k datasets for different choices of the number of probes $K$, with $M = 2000$ and $B = 100$. Lines show the mean over 5 seeds; shaded regions indicate $\pm 1$ standard error.

Theorems & Definitions (5)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • proof
  • proof