Table of Contents
Fetching ...

Recommendations for Baselines and Benchmarking Approximate Gaussian Processes

Sebastian W. Ober, Artem Artemev, Marcel Wagenländer, Rudolfs Grobins, Mark van der Wilk

TL;DR

This work tackles the challenge of evaluating approximate Gaussian process (GP) methods when hyperparameters require tuning. It introduces a standardized benchmarking framework centered on preserving automatic hyperparameter selection and uncertainty quantification, anchored by a robust SGPR baseline that can approach near-exact performance as compute increases, and a Pareto-front style analysis over compute budgets. The authors provide practical procedures to train SGPR automatically, include numerical-stability fixes, and propose metrics and protocols (including ELBO bounds) to assess fidelity to the exact GP, comparing against stochastic variational GP methods. The proposed protocol clarifies method strengths and gaps, enabling practitioners to choose suitable baselines and guiding researchers toward open problems with reproducible, fair benchmarks.

Abstract

Gaussian processes (GPs) are a mature and widely-used component of the ML toolbox. One of their desirable qualities is automatic hyperparameter selection, which allows for training without user intervention. However, in many realistic settings, approximations are typically needed, which typically do require tuning. We argue that this requirement for tuning complicates evaluation, which has led to a lack of a clear recommendations on which method should be used in which situation. To address this, we make recommendations for comparing GP approximations based on a specification of what a user should expect from a method. In addition, we develop a training procedure for the variational method of Titsias [2009] that leaves no choices to the user, and show that this is a strong baseline that meets our specification. We conclude that benchmarking according to our suggestions gives a clearer view of the current state of the field, and uncovers problems that are still open that future papers should address.

Recommendations for Baselines and Benchmarking Approximate Gaussian Processes

TL;DR

This work tackles the challenge of evaluating approximate Gaussian process (GP) methods when hyperparameters require tuning. It introduces a standardized benchmarking framework centered on preserving automatic hyperparameter selection and uncertainty quantification, anchored by a robust SGPR baseline that can approach near-exact performance as compute increases, and a Pareto-front style analysis over compute budgets. The authors provide practical procedures to train SGPR automatically, include numerical-stability fixes, and propose metrics and protocols (including ELBO bounds) to assess fidelity to the exact GP, comparing against stochastic variational GP methods. The proposed protocol clarifies method strengths and gaps, enabling practitioners to choose suitable baselines and guiding researchers toward open problems with reproducible, fair benchmarks.

Abstract

Gaussian processes (GPs) are a mature and widely-used component of the ML toolbox. One of their desirable qualities is automatic hyperparameter selection, which allows for training without user intervention. However, in many realistic settings, approximations are typically needed, which typically do require tuning. We argue that this requirement for tuning complicates evaluation, which has led to a lack of a clear recommendations on which method should be used in which situation. To address this, we make recommendations for comparing GP approximations based on a specification of what a user should expect from a method. In addition, we develop a training procedure for the variational method of Titsias [2009] that leaves no choices to the user, and show that this is a strong baseline that meets our specification. We conclude that benchmarking according to our suggestions gives a clearer view of the current state of the field, and uncovers problems that are still open that future papers should address.
Paper Structure (28 sections, 21 equations, 16 figures, 3 tables, 2 algorithms)

This paper contains 28 sections, 21 equations, 16 figures, 3 tables, 2 algorithms.

Figures (16)

  • Figure 1: SGPR with a squared exponential kernel (highest true marginal likelihood) on the toy 1D Snelson dataset. Left: Example approximate solution. Middle: Upper and lower bounds on marginal likelihood with varying $M$. Note that different hyperparameters are found as $M$ increases, which allows the upper bound to rise, before it eventually converges as the hyperparameters converge. Right: Hyperparameters with varying $M$.
  • Figure 2: SGPR with a Matérn-$\frac{1}{2}$ kernel (highest true marginal likelihood of stationary kernels) on a step dataset. Left: Example approximate solution. Middle: Upper and lower bounds on marginal likelihood with varying $M$. Note that different hyperparameters are found as $M$ increases. The upper and lower bounds do not converge. Right: Hyperparameters with varying $M$, which do not converge even when $M \approx N$.
  • Figure 3: Illustration of our proposed benchmarking procedure on the Keggundirected dataset, where SGPR yields a near-exact approximation. We plot the results from five independent runs for each method, plotting the negative ELBO (nLB), negative upper bound (nUB), negative approximate LML (for IterGP), and RMSEs and NLPDs for both methods. We also provide training set size $n$ and input dimension $d$ for reference. Lower is better for all metrics.
  • Figure 4: Our proposed benchmarking pocedure on Kin40k, where SGPR does not give a near-exact approximation.
  • Figure C.1: We plot training curves for SVGP with 1000 inducing points on keggundirected with various hyperparameter settings, changing from the optimal hyperparameter setting found using the described grid search. We also plot the SGPR mean from our proposed procedure (extracting the values for $M=1000$). Top left: Dependence on minibatch size. Top right: Dependence on learning rate. Bottom left: Dependence on optimiser momentum parameters $(\beta_1, \beta_2)$. Bottom right: Dependence on use of a scheduler.
  • ...and 11 more figures