Table of Contents
Fetching ...

Online Laplace Model Selection Revisited

Jihao Andreas Lin, Javier Antorán, José Miguel Hernández-Lobato

TL;DR

This paper reconsiders online Laplace model selection for neural networks, addressing the concern that OL updates violate the Laplace stationarity assumption. It shows that incorporating the first-order Taylor term leads to a tangent-linear model whose evidence provides a principled objective for online hyperparameter tuning, and that OL can be interpreted as maximizing a variational bound on this tangent-model evidence. At convergence, the NN parameters and the tangent-model MAP coincide, and the tangent-model evidence matches the standard Laplace evidence, explaining the observed effectiveness of online tuning. Experiments on UCI regression show online hyperparameter optimization yields better generalisation and predictive log-likelihood than offline methods, with OL often delivering the best results and improved RMSE due to a more favorable bias-variance balance.

Abstract

The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the loss, calling into question their soundness. This work re-derives online Laplace methods, showing them to target a variational bound on a mode-corrected variant of the Laplace evidence which does not make stationarity assumptions. Online Laplace and its mode-corrected counterpart share stationary points where 1. the NN parameters are a maximum a posteriori, satisfying the Laplace method's assumption, and 2. the hyperparameters maximise the Laplace evidence, motivating online methods. We demonstrate that these optima are roughly attained in practise by online algorithms using full-batch gradient descent on UCI regression datasets. The optimised hyperparameters prevent overfitting and outperform validation-based early stopping.

Online Laplace Model Selection Revisited

TL;DR

This paper reconsiders online Laplace model selection for neural networks, addressing the concern that OL updates violate the Laplace stationarity assumption. It shows that incorporating the first-order Taylor term leads to a tangent-linear model whose evidence provides a principled objective for online hyperparameter tuning, and that OL can be interpreted as maximizing a variational bound on this tangent-model evidence. At convergence, the NN parameters and the tangent-model MAP coincide, and the tangent-model evidence matches the standard Laplace evidence, explaining the observed effectiveness of online tuning. Experiments on UCI regression show online hyperparameter optimization yields better generalisation and predictive log-likelihood than offline methods, with OL often delivering the best results and improved RMSE due to a more favorable bias-variance balance.

Abstract

The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the loss, calling into question their soundness. This work re-derives online Laplace methods, showing them to target a variational bound on a mode-corrected variant of the Laplace evidence which does not make stationarity assumptions. Online Laplace and its mode-corrected counterpart share stationary points where 1. the NN parameters are a maximum a posteriori, satisfying the Laplace method's assumption, and 2. the hyperparameters maximise the Laplace evidence, motivating online methods. We demonstrate that these optima are roughly attained in practise by online algorithms using full-batch gradient descent on UCI regression datasets. The optimised hyperparameters prevent overfitting and outperform validation-based early stopping.
Paper Structure (16 sections, 7 equations, 15 figures)

This paper contains 16 sections, 7 equations, 15 figures.

Figures (15)

  • Figure 1: Second-order Taylor approximations of a function $f$ (black) around $x_0$ (star), with (orange) and without (blue) the first-order term.
  • Figure 2: Illustration of exact linear model evidence, and LM and OL hyperparameter objectives at a single train step $t$ (left). The latter two represent lower bounds as per \ref{['eq:amazing_bound']}. $\mathcal{L}_h$ is tighter which leads to larger updates, making hyperparameter trajectories unstable (right).
  • Figure 3: Evolution of neural network weights $w_t$ and tangent linear model posterior mean $v^\star$ during training. With online Laplace (OL procedure), $w_t$ and $v^\star$ converge to the same distribution. Offline training does not exhibit this behaviour.
  • Figure 4: Difference between neural network weights $w_t$ and linear model posterior mean $v^\star$ (left) and test RMSE versus ELBO ${\mathcal{L}}$ (right) throughout online training for both OL and LM procedures. A maximised ELBO does not imply optimal test RMSE.
  • Figure 5: Test log-likelihood on UCI regression (mean $\pm$ standard error over 10 splits).
  • ...and 10 more figures