Online Laplace Model Selection Revisited
Jihao Andreas Lin, Javier Antorán, José Miguel Hernández-Lobato
TL;DR
This paper reconsiders online Laplace model selection for neural networks, addressing the concern that OL updates violate the Laplace stationarity assumption. It shows that incorporating the first-order Taylor term leads to a tangent-linear model whose evidence provides a principled objective for online hyperparameter tuning, and that OL can be interpreted as maximizing a variational bound on this tangent-model evidence. At convergence, the NN parameters and the tangent-model MAP coincide, and the tangent-model evidence matches the standard Laplace evidence, explaining the observed effectiveness of online tuning. Experiments on UCI regression show online hyperparameter optimization yields better generalisation and predictive log-likelihood than offline methods, with OL often delivering the best results and improved RMSE due to a more favorable bias-variance balance.
Abstract
The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the loss, calling into question their soundness. This work re-derives online Laplace methods, showing them to target a variational bound on a mode-corrected variant of the Laplace evidence which does not make stationarity assumptions. Online Laplace and its mode-corrected counterpart share stationary points where 1. the NN parameters are a maximum a posteriori, satisfying the Laplace method's assumption, and 2. the hyperparameters maximise the Laplace evidence, motivating online methods. We demonstrate that these optima are roughly attained in practise by online algorithms using full-batch gradient descent on UCI regression datasets. The optimised hyperparameters prevent overfitting and outperform validation-based early stopping.
