Table of Contents
Fetching ...

On Cold Posteriors of Probabilistic Neural Networks: Understanding the Cold Posterior Effect and A New Way to Learn Cold Posteriors with Tight Generalization Guarantees

Yijie Zhang

TL;DR

By balancing the influence of observed data and prior regularization, temperature adjustments can address issues of underfitting or overfitting in Bayesian models, bringing improved predictive performance.

Abstract

Bayesian inference provides a principled probabilistic framework for quantifying uncertainty by updating beliefs based on prior knowledge and observed data through Bayes' theorem. In Bayesian deep learning, neural network weights are treated as random variables with prior distributions, allowing for a probabilistic interpretation and quantification of predictive uncertainty. However, Bayesian methods lack theoretical generalization guarantees for unseen data. PAC-Bayesian analysis addresses this limitation by offering a frequentist framework to derive generalization bounds for randomized predictors, thereby certifying the reliability of Bayesian methods in machine learning. Temperature $T$, or inverse-temperature $λ= \frac{1}{T}$, originally from statistical mechanics in physics, naturally arises in various areas of statistical inference, including Bayesian inference and PAC-Bayesian analysis. In Bayesian inference, when $T < 1$ (``cold'' posteriors), the likelihood is up-weighted, resulting in a sharper posterior distribution. Conversely, when $T > 1$ (``warm'' posteriors), the likelihood is down-weighted, leading to a more diffuse posterior distribution. By balancing the influence of observed data and prior regularization, temperature adjustments can address issues of underfitting or overfitting in Bayesian models, bringing improved predictive performance.

On Cold Posteriors of Probabilistic Neural Networks: Understanding the Cold Posterior Effect and A New Way to Learn Cold Posteriors with Tight Generalization Guarantees

TL;DR

By balancing the influence of observed data and prior regularization, temperature adjustments can address issues of underfitting or overfitting in Bayesian models, bringing improved predictive performance.

Abstract

Bayesian inference provides a principled probabilistic framework for quantifying uncertainty by updating beliefs based on prior knowledge and observed data through Bayes' theorem. In Bayesian deep learning, neural network weights are treated as random variables with prior distributions, allowing for a probabilistic interpretation and quantification of predictive uncertainty. However, Bayesian methods lack theoretical generalization guarantees for unseen data. PAC-Bayesian analysis addresses this limitation by offering a frequentist framework to derive generalization bounds for randomized predictors, thereby certifying the reliability of Bayesian methods in machine learning. Temperature , or inverse-temperature , originally from statistical mechanics in physics, naturally arises in various areas of statistical inference, including Bayesian inference and PAC-Bayesian analysis. In Bayesian inference, when (``cold'' posteriors), the likelihood is up-weighted, resulting in a sharper posterior distribution. Conversely, when (``warm'' posteriors), the likelihood is down-weighted, leading to a more diffuse posterior distribution. By balancing the influence of observed data and prior regularization, temperature adjustments can address issues of underfitting or overfitting in Bayesian models, bringing improved predictive performance.

Paper Structure

This paper contains 92 sections, 12 theorems, 84 equations, 24 figures, 14 tables.

Key Result

Proposition 2.1

The derivative of the empirical Gibbs loss of the tempered posterior $p_\lambda$ satisfies where $\mathbb{V}(\cdot)$ denotes the variance.

Figures (24)

  • Figure 1: Illustration of the new likelihood $q({\boldsymbol{y}}|{\boldsymbol{x}},{\boldsymbol{\theta}},\lambda)$ and priors $q({\boldsymbol{\theta}}|{\boldsymbol{X}},\lambda)$. In the left and middle figures, the original likelihood is in the form of the Bernoulli distribution. The left figure demonstrates the transformation from $\theta({\boldsymbol{x}})$ to $\theta^\star({\boldsymbol{x}}, \lambda) := \frac{\theta({\boldsymbol{x}})^{\lambda}}{\theta({\boldsymbol{x}})^{\lambda} + (1-\theta({\boldsymbol{x}}))^\lambda}$. In the middle figure, we display a Beta-Binomial example, where the prior, initialized as a Beta distribution, is updated with a single Bernoulli-distributed sample. In the right figure, we display the new prior, initialized as an inverse-gamma prior and updated with a Gaussian likelihood with a single observation.
  • Figure 2: 1. The CPE occurs in Bayesian linear regression with exact inference. 2. Model misspecification can lead to overfitting and to a "warm" posterior effect (WPE). Every column displays a specific setting, as indicated in the caption. The first row shows exact Bayesian posterior predictive fits for three different values of the tempering parameter $\lambda$. The second row shows the Gibbs loss $\hat{G}(p_\lambda, D)$ (aka training loss) and the Bayes loss $B(p_\lambda)$ (aka testing loss) with respect to $\lambda$. The experimental details are given in Appendix \ref{['app:sec:experiment']}.
  • Figure 3: Extended results of Figure \ref{['fig:blr']} with more configurations of model misspecification.
  • Figure 4: Experimental illustrations for the arguments in Section \ref{['sec:modelmisspec&CPE']} using small CNN via SGLD on MNIST. We show similar results on Fashion-MNIST with small CNN and CIFAR-10(0) with ResNet-18 in Appendix \ref{['app:sec:experiment-approx']}. Figures \ref{['fig:CPE:NarrowPriorSmall']} to \ref{['fig:CPE:StandardPriorSmall']} illustrate the arguments in Section \ref{['sec:modelmisspec&CPE']}. Figure \ref{['fig:CPE:StandardPriorSmall']} uses the standard prior ($\sigma=1$) and the standard softmax ($\gamma=1$) for the likelihood without applying DA. Figure \ref{['fig:CPE:NarrowPriorSmall']} follows a similar setup except for using a narrow prior. Figure \ref{['fig:CPE:SoftmaxSmall']} uses a narrow prior as in Figure \ref{['fig:CPE:NarrowPriorSmall']} but with a tempered softmax that results in a lower aleatoric uncertainty. We report the training loss $\hat{G}(p_\lambda,D)$ and the testing losses, $B(p_\lambda)$ and $G(p_\lambda)$, from 10 samples of the small Convolutional neural network (CNN) via Stochastic Gradient Langevin Dynamics (SGLD). We show the mean and standard error across three different seeds. For additional experimental details, please refer to Appendix \ref{['app:sec:experiment-approx']}.
  • Figure 5: Experimental illustrations for the arguments in Section \ref{['sec:modelmisspec&CPE']} using large CNN via SGLD on MNIST. We show similar results on Fashion-MNIST with large CNN and CIFAR-10(0) with ResNet-50 in Appendix \ref{['app:sec:experiment-approx']}. The experiment setup is similar to the setups in Figure \ref{['fig:CPE:small']} but with a large CNN. Please refer to Appendix \ref{['app:sec:experiment-approx']} for further details on the model.
  • ...and 19 more figures

Theorems & Definitions (18)

  • Definition 2.1
  • Proposition 2.1
  • Proposition 2.2
  • Theorem 2.3
  • Proposition 2.4
  • Proposition 2.5
  • Proposition 2.6
  • proof
  • Proposition 2.7
  • proof
  • ...and 8 more