Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Jihao Andreas Lin; Shreyas Padhy; Bruno Mlodozeniec; Javier Antorán; José Miguel Hernández-Lobato

Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, Javier Antorán, José Miguel Hernández-Lobato

TL;DR

This work tackles the scalability of Gaussian process hyperparameter optimisation on large datasets by recasting GP computations in an iterative framework. It introduces a pathwise gradient estimator, warm-starting of linear-system solvers, and early stopping under compute budgets, showcasing how these components synergistically accelerate marginal likelihood optimisation while enabling posterior sampling via pathwise conditioning. The pathwise approach reduces solver iterations, yields posterior samples without extra solves, and, when combined with warm starts, delivers up to $72\times$ speed-ups with negligible bias in practice. The methods are validated across diverse UCI datasets and solver types, with strong empirical evidence that significant computational savings do not come at the expense of predictive performance, and they are complemented by theoretical justifications and public code availability. These contributions substantially enhance the practicality of scalable GP-based hyperparameter optimisation in real-world, large-scale settings.

Abstract

Scaling hyperparameter optimisation to very large datasets remains an open problem in the Gaussian process community. This paper focuses on iterative methods, which use linear system solvers, like conjugate gradients, alternating projections or stochastic gradient descent, to construct an estimate of the marginal likelihood gradient. We discuss three key improvements which are applicable across solvers: (i) a pathwise gradient estimator, which reduces the required number of solver iterations and amortises the computational cost of making predictions, (ii) warm starting linear system solvers with the solution from the previous step, which leads to faster solver convergence at the cost of negligible bias, (iii) early stopping linear system solvers after a limited computational budget, which synergises with warm starting, allowing solver progress to accumulate over multiple marginal likelihood steps. These techniques provide speed-ups of up to $72\times$ when solving to tolerance, and decrease the average residual norm by up to $7\times$ when stopping early.

Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

TL;DR

speed-ups with negligible bias in practice. The methods are validated across diverse UCI datasets and solver types, with strong empirical evidence that significant computational savings do not come at the expense of predictive performance, and they are complemented by theoretical justifications and public code availability. These contributions substantially enhance the practicality of scalable GP-based hyperparameter optimisation in real-world, large-scale settings.

Abstract

when solving to tolerance, and decrease the average residual norm by up to

when stopping early.

Paper Structure (39 sections, 7 theorems, 48 equations, 16 figures, 10 tables, 3 algorithms)

This paper contains 39 sections, 7 theorems, 48 equations, 16 figures, 10 tables, 3 algorithms.

Introduction
Gaussian Process Regression and Marginal Likelihood Optimisation
Pathwise Conditioning
The Marginal Likelihood and Its Gradient
Hierarchical View of Marginal Likelihood Optimisation for Iterative Gaussian Processes
Outer-Loop Optimiser
Gradient Estimator
Linear System Solver
Pathwise Estimation of Marginal Likelihood Gradients
Initial Distance to the Linear System Solution
Amortising Linear Solves for Optimisation and Prediction
How Many Probe Vectors and Posterior Samples Do We Need?
Estimator Variance
Approximate Prior Function Samples Using Random Features
Warm Starting Linear System Solvers
...and 24 more sections

Key Result

theorem 1

(informal) Under reasonable assumptions, the marginal likelihood $\mathcal{L}$ of the hyperparameters obtained by maximising the objective implied by the warm-started gradients $\tilde{\bm{\theta}}^*$ will converge in probability to the marginal likelihood of a true maximum $\bm{\theta}^*$: $\mathca

Figures (16)

Figure 1: Comparison of relative runtimes for different methods, linear system solvers, and datasets. The linear system solver (hatched areas) dominates the total training time (coloured patches). The pathwise gradient estimator requires less time than the standard estimator. Initialising at the previous solution (warm start) further reduces the runtime of the linear system solver for both estimators.
Figure 2: Marginal likelihood optimisation for iterative GPs.
Figure 3: On the pol and elevators datasets, the pathwise estimator results in a lower RKHS distance \ref{['eq:RKHS_distance']} between solver initialisation and solution, as predicted by theory (\ref{['eq:standard_dist']},\ref{['eq:pathwise_dist']}) (left). This results in fewer AP iterations until reaching the tolerance (left middle). When using the standard estimator, the initial distance follows the top eigenvalue of $\mathbf{H}_{\bm{\theta}}^{-1}$ (right middle), which is strongly related to the noise precision (right). The latter tends to increase during marginal likelihood optimisation when fitting the data. The effects are greater on pol due to the higher noise precision.
Figure 4: On the pol dataset, increasing the number of posterior samples improves the performance of pathwise conditioning until diminishing returns start to manifest with more than 64 samples (left). Furthermore, with $4\times$ as many probe vectors, the total cumulative runtime only increases by around 10% because the computational costs are dominated by shared kernel function evaluations (right).
Figure 5: Across all datasets and marginal likelihood steps, most hyperparameter trajectories of the pathwise estimator rarely differ from exact optimisation, as shown by the histogram illustrating the differences between hyperparameters (left). On selected length scales of the elevators dataset, the pathwise estimator deviates due to the use of random features to approximate prior function samples. With exact samples from the prior, the pathwise estimator matches exact optimisation again (right).
...and 11 more figures

Theorems & Definitions (15)

theorem 1
definition 1: Sub-gaussian norm
definition 2: Sub-exponential norm
theorem 2
lemma 1: Computing the operator norm on a net vershynin2018high
lemma 2: Size of $\eps$-net on $\mathcal{S}^{n-1}$ vershynin2018high
lemma 3
proof
lemma 4
proof
...and 5 more

Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

TL;DR

Abstract

Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (15)