On the Laplace Approximation as Model Selection Criterion for Gaussian Processes

Andreas Besginow; Jan David Hüwel; Thomas Pawellek; Christian Beecks; Markus Lange-Hegermann

On the Laplace Approximation as Model Selection Criterion for Gaussian Processes

Andreas Besginow, Jan David Hüwel, Thomas Pawellek, Christian Beecks, Markus Lange-Hegermann

TL;DR

This work focuses on evaluating model performance of Gaussian process models, i.e. finding a metric that provides the best trade-off between all those criteria, and introduces multiple metrics based on the Laplace approximation.

Abstract

Model selection aims to find the best model in terms of accuracy, interpretability or simplicity, preferably all at once. In this work, we focus on evaluating model performance of Gaussian process models, i.e. finding a metric that provides the best trade-off between all those criteria. While previous work considers metrics like the likelihood, AIC or dynamic nested sampling, they either lack performance or have significant runtime issues, which severely limits applicability. We address these challenges by introducing multiple metrics based on the Laplace approximation, where we overcome a severe inconsistency occuring during naive application of the Laplace approximation. Experiments show that our metrics are comparable in quality to the gold standard dynamic nested sampling without compromising for computational speed. Our model selection criteria allow significantly faster and high quality model selection of Gaussian process models.

On the Laplace Approximation as Model Selection Criterion for Gaussian Processes

TL;DR

Abstract

Paper Structure (24 sections, 4 theorems, 10 equations, 5 figures, 5 tables)

This paper contains 24 sections, 4 theorems, 10 equations, 5 figures, 5 tables.

Introduction
Preliminaries
Gaussian Processes
Model Selection for GPs
Kernel Search Algorithms
Laplace approximation of GP model evidence
Overcoming inconsistencies of the Laplace approximation
Evaluation
Interpretable example
Kernel search experiments
Real world dataset
Conclusion
Additional experiment details
Interpretable experiment
Kernel search experiment
...and 9 more sections

Key Result

Lemma 3.1

To ensure that each hyperparameter of a has a minimal negative contribution $r$ in the last to summands of formula eq:gp_likelihood_prior_laplace_approximation to the model evidence $\mathcal{Z}$, every eigenvalue $\lambda$ of the Hessian needs to be at least:

Figures (5)

Figure 1: A conceptual visualization of the inconsistency when naively applying the Laplace approximation and one of our suggested variants. Left: The posterior over parametrizations $\theta$, with a degenerate local extremum (red dot). The model evidence $\mathcal{Z}$ is the gray shaded area. Middle: Naive application of the Laplace approximation around the optimum with infinitely large model evidence approximation $\mathcal{Z}_{\text{Lap}} \approx \infty$ overlaid in red. Right: Application of our stabilized Laplace ($\mathop{\mathrm{\text{Lap}_0}}\limits$) around the optimum with model evidence approximation $\mathcal{Z}_{\mathop{\mathrm{\text{Lap}_0}}\limits} \approx \mathcal{Z}$ overlaid in green.
Figure 2: Results of a nested sampling for the linear noisy dataset in Section \ref{['sec:evaluation_interpretable_example']} (showing 1024 representative samples out of 12,255 total samples), higher values are better. The differently colored ellipses show the $2\sigma$ confidence ellipses for the normal distributions associated with the Laplace approximations for $\mathop{\mathrm{\text{Lap}_0}}\limits$ (black), $\mathop{\mathrm{\text{Lap}_A}}\limits$ (brown), $\mathop{\mathrm{\text{Lap}_B}}\limits$ (purple) and the standard Laplace approximation (blue). The dotted gray ellipse is the $2\sigma$ confidence ellipse for the hyperparameter prior. The $\times$ shows the optimum found during nested sampling. We see that the ellipses derived from our versions of the Laplace approximation give a good approximation of the most relevant area of the likelihood surface, since they cover the majority of the samples of nested sampling.
Figure 3: The $\pm$ one standard deviation, between the log model evidence and the respective metric's value, across varying dataset sizes. and have been rescaled by $-0.5$ to have the same scale as the model evidence. Smaller is better. Our variants of the Laplace approximation are drawn in bold.
Figure 4: Average time to calculate our metrics, averaged over each kernel evaluated during kernel searches, on a logarithmic scale. For our metrics, the variants of the Laplace approximation, the time includes the training procedure with two random restarts. This shows that the computation time of our approaches are two orders of magnitude smaller than that of dynamic nested sampling.
Figure 5: The dataset used in the first experiment. Ten datapoints $y = x + \sigma_n$ at evenly distributed locations $x = 0\ldots 1$ with $\sigma_n \sim \mathcal{N}(0, 0.1)$

Theorems & Definitions (5)

Lemma 3.1
Corollary 3.2
Corollary 3.3
Corollary 3.4
proof

On the Laplace Approximation as Model Selection Criterion for Gaussian Processes

TL;DR

Abstract

On the Laplace Approximation as Model Selection Criterion for Gaussian Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)