Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes
Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, José Miguel Hernández-Lobato
TL;DR
The paper tackles the computational bottleneck of learning Gaussian process hyperparameters via marginal likelihood for large datasets, where exact Cholesky factorisation is costly. It proposes an iterative GP framework with a three-level marginal likelihood optimisation and a warm-start amortisation scheme that reuses solutions to initialize the next step. The authors derive a bound showing that, with enough trace samples, the bias in the gradient from fixed probe vectors and Taylor-based warm starts yields a gamma-close optimum with high probability. Empirical evaluation on five UCI regression datasets shows that warm starts match the exact-gradient performance in predictive log-likelihood while delivering up to 16x speed-ups in total runtime, enabling scalable iterative GP training with minimal performance loss.
Abstract
Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between compute time and accuracy of a solution. We introduce a three-level hierarchy of marginal likelihood optimisation for iterative Gaussian processes, and identify that the computational costs are dominated by solving sequential batches of large positive-definite systems of linear equations. We then propose to amortise computations by reusing solutions of linear system solvers as initialisations in the next step, providing a $\textit{warm start}$. Finally, we discuss the necessary conditions and quantify the consequences of warm starts and demonstrate their effectiveness on regression tasks, where warm starts achieve the same results as the conventional procedure while providing up to a $16 \times$ average speed-up among datasets.
