Table of Contents
Fetching ...

Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, José Miguel Hernández-Lobato

TL;DR

The paper tackles the computational bottleneck of learning Gaussian process hyperparameters via marginal likelihood for large datasets, where exact Cholesky factorisation is costly. It proposes an iterative GP framework with a three-level marginal likelihood optimisation and a warm-start amortisation scheme that reuses solutions to initialize the next step. The authors derive a bound showing that, with enough trace samples, the bias in the gradient from fixed probe vectors and Taylor-based warm starts yields a gamma-close optimum with high probability. Empirical evaluation on five UCI regression datasets shows that warm starts match the exact-gradient performance in predictive log-likelihood while delivering up to 16x speed-ups in total runtime, enabling scalable iterative GP training with minimal performance loss.

Abstract

Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between compute time and accuracy of a solution. We introduce a three-level hierarchy of marginal likelihood optimisation for iterative Gaussian processes, and identify that the computational costs are dominated by solving sequential batches of large positive-definite systems of linear equations. We then propose to amortise computations by reusing solutions of linear system solvers as initialisations in the next step, providing a $\textit{warm start}$. Finally, we discuss the necessary conditions and quantify the consequences of warm starts and demonstrate their effectiveness on regression tasks, where warm starts achieve the same results as the conventional procedure while providing up to a $16 \times$ average speed-up among datasets.

Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

TL;DR

The paper tackles the computational bottleneck of learning Gaussian process hyperparameters via marginal likelihood for large datasets, where exact Cholesky factorisation is costly. It proposes an iterative GP framework with a three-level marginal likelihood optimisation and a warm-start amortisation scheme that reuses solutions to initialize the next step. The authors derive a bound showing that, with enough trace samples, the bias in the gradient from fixed probe vectors and Taylor-based warm starts yields a gamma-close optimum with high probability. Empirical evaluation on five UCI regression datasets shows that warm starts match the exact-gradient performance in predictive log-likelihood while delivering up to 16x speed-ups in total runtime, enabling scalable iterative GP training with minimal performance loss.

Abstract

Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between compute time and accuracy of a solution. We introduce a three-level hierarchy of marginal likelihood optimisation for iterative Gaussian processes, and identify that the computational costs are dominated by solving sequential batches of large positive-definite systems of linear equations. We then propose to amortise computations by reusing solutions of linear system solvers as initialisations in the next step, providing a . Finally, we discuss the necessary conditions and quantify the consequences of warm starts and demonstrate their effectiveness on regression tasks, where warm starts achieve the same results as the conventional procedure while providing up to a average speed-up among datasets.
Paper Structure (12 sections, 7 theorems, 24 equations, 9 figures, 2 tables)

This paper contains 12 sections, 7 theorems, 24 equations, 9 figures, 2 tables.

Key Result

theorem 1

Let $\mathcal{L}$ and $\nabla \mathcal{L}$ be the marginal likelihood and its gradient as defined in eq:mll and eq:mll_grad respectively, and let $\tilde{\bm{g}}$ be an approximation to the gradient $\nabla \mathcal{L}$ where the trace is approximated with $s$ fixed samples as in eq:trace. Assume th with probability at least $1-\delta$.

Figures (9)

  • Figure 1: Two-dimensional cross-sections of quadratic objectives targeted by linear solvers after twenty marginal likelihood steps on the pol dataset, centred at the solution and visualised along eigendirections corresponding to the two largest eigenvalues (left), and evolution of the distance between initialisation and solution measured as root-mean-square error with respect to the norm induced by the curvature of the quadratic objective (right). Initialising at the previous solution (warm start) substantially reduces the initial distance to the solution.
  • Figure 2: Marginal likelihood optimisation framework for iterative Gaussian processes.
  • Figure 3: Comparison of relative runtimes for different linear system solvers. The solver (striped areas) dominates the total training time (coloured patches). Initialising at the previous solution (warm start) significantly reduces the runtime of the linear system solver, with varying effectiveness among different solvers and datasets.
  • Figure 4: Evolution of the required number of linear system solver iterations at each step of marginal likelihood optimisation on the pol dataset. Initialising at the solution of the previous step (warm start) reduces the number of required solver iterations with varying effectiveness among different solvers.
  • Figure 5: Evolution of hyperparameters during marginal likelihood optimisation on the pol dataset using conjugate gradients as linear system solver. The behaviour of exact gradient computation using Cholesky factorisation is obtained when initialising at zero or at the previous solution. The latter does not degrade performance.
  • ...and 4 more figures

Theorems & Definitions (12)

  • theorem 1
  • theorem 2
  • proof
  • lemma 1
  • proof
  • lemma 2
  • proof
  • lemma 3
  • proof
  • theorem 3
  • ...and 2 more