Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent
Jihao Andreas Lin, Javier Antorán, Shreyas Padhy, David Janz, José Miguel Hernández-Lobato, Alexander Terenin
TL;DR
This paper addresses the computational bottleneck of Gaussian Process posterior sampling by reframing GP conditioning as stochastic optimization, enabling low-cost posterior mean and sample computations via SGD. It introduces a stochastic objective for the posterior mean and a variance-reducing approach for posterior samples, augmented with Random Fourier Features and inducing-point extensions to scale to large datasets. Theoretical analysis reveals an implicit bias where SGD rapidly converges along top kernel-spectrum directions and slowly along low-eigenvalue directions, with a three-region posterior geometry (prior, interpolation, extrapolation). Empirically, SGD-based GP posteriors achieve competitive or state-of-the-art predictive performance on large-scale or ill-conditioned tasks and provide well-calibrated uncertainty for parallel Thompson sampling, offering practical scalability for uncertainty quantification in real-world settings.
Abstract
Gaussian processes are a powerful framework for quantifying uncertainty and for sequential decision-making but are limited by the requirement of solving linear systems. In general, this has a cubic cost in dataset size and is sensitive to conditioning. We explore stochastic gradient algorithms as a computationally efficient method of approximately solving these linear systems: we develop low-variance optimization objectives for sampling from the posterior and extend these to inducing points. Counterintuitively, stochastic gradient descent often produces accurate predictions, even in cases where it does not converge quickly to the optimum. We explain this through a spectral characterization of the implicit bias from non-convergence. We show that stochastic gradient descent produces predictive distributions close to the true posterior both in regions with sufficient data coverage, and in regions sufficiently far away from the data. Experimentally, stochastic gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks. Its uncertainty estimates match the performance of significantly more expensive baselines on a large-scale Bayesian optimization task.
