Stochastic Gradient Descent for Gaussian Processes Done Right

Jihao Andreas Lin; Shreyas Padhy; Javier Antorán; Austin Tripp; Alexander Terenin; Csaba Szepesvári; José Miguel Hernández-Lobato; David Janz

Stochastic Gradient Descent for Gaussian Processes Done Right

Jihao Andreas Lin, Shreyas Padhy, Javier Antorán, Austin Tripp, Alexander Terenin, Csaba Szepesvári, José Miguel Hernández-Lobato, David Janz

TL;DR

The paper addresses the computational bottleneck of Gaussian process regression by solving the linear system $(K+\lambda I)^{-1}b$ efficiently with stochastic dual descent (SDD), a dual-gradient SGD method. SDD uses a dual objective, multiplicative-noise gradient estimates via random coordinates, Nesterov momentum, and geometric iterate averaging to accelerate convergence for mean estimation and posterior sampling. Empirically, SDD matches or surpasses preconditioned CG and variational GP methods on UCI benchmarks and Bayesian optimization, and achieves competitive performance with state-of-the-art graph neural networks on large molecular docking predictions. This work demonstrates that a carefully designed first-order stochastic method can render Gaussian processes competitive with modern deep learning approaches in large-scale, uncertainty-aware tasks, broadening their practical applicability.

Abstract

As is well known, both sampling from the posterior and computing the mean of the posterior in Gaussian process regression reduces to solving a large linear system of equations. We study the use of stochastic gradient descent for solving this linear system, and show that when \emph{done right} -- by which we mean using specific insights from the optimisation and kernel communities -- stochastic gradient descent is highly effective. To that end, we introduce a particularly simple \emph{stochastic dual descent} algorithm, explain its design in an intuitive manner and illustrate the design choices through a series of ablation studies. Further experiments demonstrate that our new method is highly competitive. In particular, our evaluations on the UCI regression tasks and on Bayesian optimisation set our approach apart from preconditioned conjugate gradients and variational Gaussian process approximations. Moreover, our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.

Stochastic Gradient Descent for Gaussian Processes Done Right

TL;DR

The paper addresses the computational bottleneck of Gaussian process regression by solving the linear system

efficiently with stochastic dual descent (SDD), a dual-gradient SGD method. SDD uses a dual objective, multiplicative-noise gradient estimates via random coordinates, Nesterov momentum, and geometric iterate averaging to accelerate convergence for mean estimation and posterior sampling. Empirically, SDD matches or surpasses preconditioned CG and variational GP methods on UCI benchmarks and Bayesian optimization, and achieves competitive performance with state-of-the-art graph neural networks on large molecular docking predictions. This work demonstrates that a carefully designed first-order stochastic method can render Gaussian processes competitive with modern deep learning approaches in large-scale, uncertainty-aware tasks, broadening their practical applicability.

Abstract

Paper Structure (20 sections, 28 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 28 equations, 9 figures, 3 tables, 1 algorithm.

Introduction
Gaussian Process Regression
Stochastic Dual Descent for Regression and Sampling
Gradient Descent: Primal versus Dual Objectives
Randomised Gradients: Random Features versus Random Coordinates
Nesterov's Momentum and Polyak-Ruppert Iterate Averaging
Connections to the Literature
Experiments and Benchmarks
UCI Regression Baselines
Large-scale Thompson Sampling
Molecule-protein Binding Affinity Prediction for Drug Discovery
Conclusion
Convex Duality and Uniform Approximation Bounds
Effects of Varying Step-size and Batch-size
Additional Details on Experimental Setups and Results
...and 5 more sections

Figures (9)

Figure 1: Comparison of full-batch primal and dual gradient descent on pol with varying step-sizes. Primal gradient descent becomes unstable and diverges for $\beta n$ greater than $0.1$. Dual gradient descent is stable with larger step-sizes, allowing for markedly faster convergence than the primal. For $\beta n =0.1$, the dual method makes more progress in the $K$-norm, whereas the primal in the $K^2$-norm.
Figure 2: A comparison of dual stochastic gradient descent on the pol data set with either random Fourier features or random coordinates, using batch size $B=512$, momentum $\rho=0.9$ and averaging parameter $r=0.001$ (see \ref{['subsec:acceleration_and_averaging']} for explanation of latter two). Random features converge with $\beta n=5\times 10^{-4}$ but perform poorly, and diverge with a higher step-size. Random coordinates are stable with $\beta n = 50$ and show much stronger performance on all metrics. We include a version of random coordinates where only the $K\alpha$ term is subsampled: this breaks the multiplicative noise property, and results in an estimate which is worse on both the $K$-norm and the $K^2$-norm metric.
Figure 3: Comparison of dual stochastic gradient descent on the pol data set with different acceleration methods, using batch size $B = 512$, a geometric averaging parameter $r = 0.001$, and step-sizes tuned individually for each method (AdaGrad $\beta n = 10$; RMSprop & Adam $\beta n = 0.05$; Nesterov's momentum $\beta n = 50$). Both Adam and Nesterov's momentum perform well on Test RMSE, but the latter performs better on the $K$ and $K^2$ norms.
Figure 4: Comparison of optimisation strategies for random coordinate estimator of the dual objective on the pol data set, using momentum $\rho = 0.9$, averaging parameter $r = 0.001$, batch size $B=128$, and step-size $\beta n = 50$. Nesterov's momentum significantly improves convergence speed across all metrics. The dashed olive line, marked arithmetic averaging, shows the regular iterate up until 70k steps, at which point averaging commences and the averaged iterate is shown. Arithmetic iterate averaging slows down convergence in $K$-norm once enabled. Geometric iterate averaging, on the other hand, outperforms arithmetic averaging and unaveraged iterates throughout optimisation.
Figure 5: Results for the Thompson sampling task. Plots show mean and standard error of the maximum function values identified, across $5$ length scales and $10$ seeds, against both the number of observations acquired and the corresponding compute time on an A100 GPU. The compute time includes drawing posterior function samples and finding their maxima. All methods share an initial data set of 50k points, and take 30 steps of parallel Thompson sampling, acquiring $1$k points at each.
...and 4 more figures

Theorems & Definitions (4)

Claim 1: Strong duality
proof
Claim 2: Uniform approximation
proof

Stochastic Gradient Descent for Gaussian Processes Done Right

TL;DR

Abstract

Stochastic Gradient Descent for Gaussian Processes Done Right

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (4)