Temperature Optimization for Bayesian Deep Learning
Kenyon Ng, Chris van der Heide, Liam Hodgkinson, Susan Wei
TL;DR
This work addresses the cold posterior effect in Bayesian deep learning by proposing a data-driven method to select the tempering parameter $eta$, treated as a model parameter, with the objective of maximizing the test log-predictive density (LPD). By reframing the tempered posterior as the posterior from a tempered model, the authors jointly optimize $ heta$ and $eta$ via SGD, avoiding costly grid searches. The approach is evaluated on regression and classification tasks (including CIFAR-10 with and without data augmentation), showing performance comparable to grid search while substantially reducing computation time; TM-PD often yields higher test LPD for regression, with SM-PD sometimes favored in certain classification settings. The paper also contrasts Bayesian deep learning perspectives with Generalized Bayes, discussing model misspecification, calibration, and prior choices, and provides theoretical and empirical support for the proposed temperature selection method. Overall, the method offers a practical, data-driven tool for temperature tuning in tempered posteriors, with implications for predictive performance and uncertainty quantification in large-scale neural networks.
Abstract
The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE' suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily emphasizes the predictive performance of the PPD, the latter prioritizes the utility of the posterior under model misspecification; these distinct objectives lead to different temperature preferences.
