Online Laplace Model Selection Revisited

Jihao Andreas Lin; Javier Antorán; José Miguel Hernández-Lobato

Online Laplace Model Selection Revisited

Jihao Andreas Lin, Javier Antorán, José Miguel Hernández-Lobato

TL;DR

This paper reconsiders online Laplace model selection for neural networks, addressing the concern that OL updates violate the Laplace stationarity assumption. It shows that incorporating the first-order Taylor term leads to a tangent-linear model whose evidence provides a principled objective for online hyperparameter tuning, and that OL can be interpreted as maximizing a variational bound on this tangent-model evidence. At convergence, the NN parameters and the tangent-model MAP coincide, and the tangent-model evidence matches the standard Laplace evidence, explaining the observed effectiveness of online tuning. Experiments on UCI regression show online hyperparameter optimization yields better generalisation and predictive log-likelihood than offline methods, with OL often delivering the best results and improved RMSE due to a more favorable bias-variance balance.

Abstract

The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the loss, calling into question their soundness. This work re-derives online Laplace methods, showing them to target a variational bound on a mode-corrected variant of the Laplace evidence which does not make stationarity assumptions. Online Laplace and its mode-corrected counterpart share stationary points where 1. the NN parameters are a maximum a posteriori, satisfying the Laplace method's assumption, and 2. the hyperparameters maximise the Laplace evidence, motivating online methods. We demonstrate that these optima are roughly attained in practise by online algorithms using full-batch gradient descent on UCI regression datasets. The optimised hyperparameters prevent overfitting and outperform validation-based early stopping.

Online Laplace Model Selection Revisited

TL;DR

Abstract

Paper Structure (16 sections, 7 equations, 15 figures)

This paper contains 16 sections, 7 equations, 15 figures.

Introduction
Preliminaries: Laplace approximation and online variants
Online hyperparameter tuning
Understanding online Laplace through the tangent model
Online hyperparameter optimisation including the first-order term
Online hyperparameter optimisation with the tangent linear model
Online Laplace as a variational bound on the tangent model's evidence
Convergence behaviour
Experiments
Convergence of NN and linear model posteriors
Predictive performance
Evolution of neural network training
Additional predictive metrics
Related work
MacKay's hyperparameter update
...and 1 more sections

Figures (15)

Figure 1: Second-order Taylor approximations of a function $f$ (black) around $x_0$ (star), with (orange) and without (blue) the first-order term.
Figure 2: Illustration of exact linear model evidence, and LM and OL hyperparameter objectives at a single train step $t$ (left). The latter two represent lower bounds as per \ref{['eq:amazing_bound']}. $\mathcal{L}_h$ is tighter which leads to larger updates, making hyperparameter trajectories unstable (right).
Figure 3: Evolution of neural network weights $w_t$ and tangent linear model posterior mean $v^\star$ during training. With online Laplace (OL procedure), $w_t$ and $v^\star$ converge to the same distribution. Offline training does not exhibit this behaviour.
Figure 4: Difference between neural network weights $w_t$ and linear model posterior mean $v^\star$ (left) and test RMSE versus ELBO ${\mathcal{L}}$ (right) throughout online training for both OL and LM procedures. A maximised ELBO does not imply optimal test RMSE.
Figure 5: Test log-likelihood on UCI regression (mean $\pm$ standard error over 10 splits).
...and 10 more figures

Online Laplace Model Selection Revisited

TL;DR

Abstract

Online Laplace Model Selection Revisited

Authors

TL;DR

Abstract

Table of Contents

Figures (15)