Table of Contents
Fetching ...

Temperature Optimization for Bayesian Deep Learning

Kenyon Ng, Chris van der Heide, Liam Hodgkinson, Susan Wei

TL;DR

This work addresses the cold posterior effect in Bayesian deep learning by proposing a data-driven method to select the tempering parameter $eta$, treated as a model parameter, with the objective of maximizing the test log-predictive density (LPD). By reframing the tempered posterior as the posterior from a tempered model, the authors jointly optimize $ heta$ and $eta$ via SGD, avoiding costly grid searches. The approach is evaluated on regression and classification tasks (including CIFAR-10 with and without data augmentation), showing performance comparable to grid search while substantially reducing computation time; TM-PD often yields higher test LPD for regression, with SM-PD sometimes favored in certain classification settings. The paper also contrasts Bayesian deep learning perspectives with Generalized Bayes, discussing model misspecification, calibration, and prior choices, and provides theoretical and empirical support for the proposed temperature selection method. Overall, the method offers a practical, data-driven tool for temperature tuning in tempered posteriors, with implications for predictive performance and uncertainty quantification in large-scale neural networks.

Abstract

The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE' suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily emphasizes the predictive performance of the PPD, the latter prioritizes the utility of the posterior under model misspecification; these distinct objectives lead to different temperature preferences.

Temperature Optimization for Bayesian Deep Learning

TL;DR

This work addresses the cold posterior effect in Bayesian deep learning by proposing a data-driven method to select the tempering parameter , treated as a model parameter, with the objective of maximizing the test log-predictive density (LPD). By reframing the tempered posterior as the posterior from a tempered model, the authors jointly optimize and via SGD, avoiding costly grid searches. The approach is evaluated on regression and classification tasks (including CIFAR-10 with and without data augmentation), showing performance comparable to grid search while substantially reducing computation time; TM-PD often yields higher test LPD for regression, with SM-PD sometimes favored in certain classification settings. The paper also contrasts Bayesian deep learning perspectives with Generalized Bayes, discussing model misspecification, calibration, and prior choices, and provides theoretical and empirical support for the proposed temperature selection method. Overall, the method offers a practical, data-driven tool for temperature tuning in tempered posteriors, with implications for predictive performance and uncertainty quantification in large-scale neural networks.

Abstract

The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE' suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily emphasizes the predictive performance of the PPD, the latter prioritizes the utility of the posterior under model misspecification; these distinct objectives lead to different temperature preferences.
Paper Structure (68 sections, 3 theorems, 49 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 68 sections, 3 theorems, 49 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

Lemma 3.1

Consider a linear regression model $p(y | x, \theta) = \mathcal{N}(y | x^{\top} \theta, \sigma^{2})$ with a $d$-dimensional input $x$ and known variance $\sigma^{2}$, and a prior $p(\theta) = \mathcal{N}(\theta | 0, \sigma^{2}_{p})$ with finite variance $\sigma^{2}_{p}$. Let $\bm{X} \coloneq (x_{1}, where $\hat{\theta}_{\textrm{MAP}} \coloneq \bm{\Sigma} \bm{X}^{\top} \bm{y}$ is the maximum-a-post

Figures (6)

  • Figure 1: Test LPD plotted against inverse temperature $\beta$. We compare two types of PPD: SM-PD (green) as defined in \ref{['eq:smpd']} and TM-PD (blue) as defined in \ref{['eq:tmpd']}. Zoomed-in versions of these curves are also provided in \ref{['fig:test-lpd']} and \ref{['fig:test-tempered-lpd']}, respectively. In each example, we have five evaluations of $\hat{\beta}^{*}$ from our method. Each of these $\hat{\beta}^{*}$ has a corresponding test LPD computed with SM-PD (red circle) and TM-PD (red cross). Some of the red crosses in the CIFAR-10 examples are out of range. Solid lines and shaded areas represent the mean $\pm$ standard error across five repetitions. The vertical dotted lines indicate the PPD at $\beta = 1$. Higher test LPD is better.
  • Figure 2: Test LPD and accuracy of CIFAR-10 plotted against inverse temperature $\beta$ under various levels of data augmentation (color). The lines and shaded areas represent the mean $\pm$ standard error across five repetitions. There are five dots for each colored curve, and each of these dots corresponds to a repetition of $\hat{\beta}^{*}$ from our method. The vertical dotted lines indicate the PPD at $\beta = 1$. There is a subtle shift of peaks from left to right as the augmentation strength increases. Higher test LPD and accuracy indicate better performance.
  • Figure 3: Validation LPD plotted against inverse temperature $\beta$. This is computed with SM-PD as defined in \ref{['eq:smpd']}. Solid lines and shaded area represent mean $\pm$ standard error across five repetitions. The vertical dotted lines indicate the PPD at $\beta = 1$. There are five red dots in each plot, each of them corresponding to a repetition of $\hat{\beta}^{*}$ from our method. Higher LPD indicates better performance.
  • Figure 4: Test LPD plotted against inverse temperature $\beta$ with SGMCMC (green, solid). This is computed with SM-PD as defined in \ref{['eq:smpd']}. The SGD solution (horizontal, black, dashed) is included as a reference. The SGD reference in MNIST and CIFAR10 examples performs considerably worse and is out of range. Lines and shaded area represent mean $\pm$ standard error across five repetitions. The vertical dotted lines indicate the PPD at $\beta = 1$. There are five red dots in each plot, each of them corresponding to a repetition of $\hat{\beta}^{*}$ from our method. Higher LPD indicates better performance.
  • Figure 5: Test LPD plotted against inverse temperature $\beta$ with SGMCMC (blue, solid). This is computed with TM-PD as defined in \ref{['eq:tmpd']}. The SGD solution (horizontal, black, dashed) is included as a reference. Lines and shaded area represent mean $\pm$ standard error across five repetitions. The vertical dotted lines indicate the PPD at $\beta = 1$. There are five red dots in each plot, each of them corresponding to a repetition of $\hat{\beta}^{*}$ from our method. Higher LPD indicates better performance.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Lemma 3.1
  • Lemma B.1
  • proof
  • Lemma E.1
  • proof