Table of Contents
Fetching ...

Can a Confident Prior Replace a Cold Posterior?

Martin Marek, Brooks Paige, Pavel Izmailov

TL;DR

This work explores whether posterior tempering can be replaced by a confidence-inducing prior distribution, and introduces a DirClip prior that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled.

Abstract

Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether posterior tempering can be replaced by a confidence-inducing prior distribution. First, we introduce a "DirClip" prior that is practical to sample and nearly matches the performance of a cold posterior. Second, we introduce a "confidence prior" that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled. Lastly, we provide several general insights into confidence-inducing priors, such as when they might diverge and how fine-tuning can mitigate numerical instability.

Can a Confident Prior Replace a Cold Posterior?

TL;DR

This work explores whether posterior tempering can be replaced by a confidence-inducing prior distribution, and introduces a DirClip prior that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled.

Abstract

Benchmark datasets used for image classification tend to have very low levels of label noise. When Bayesian neural networks are trained on these datasets, they often underfit, misrepresenting the aleatoric uncertainty of the data. A common solution is to cool the posterior, which improves fit to the training data but is challenging to interpret from a Bayesian perspective. We explore whether posterior tempering can be replaced by a confidence-inducing prior distribution. First, we introduce a "DirClip" prior that is practical to sample and nearly matches the performance of a cold posterior. Second, we introduce a "confidence prior" that directly approximates a cold likelihood in the limit of decreasing temperature but cannot be easily sampled. Lastly, we provide several general insights into confidence-inducing priors, such as when they might diverge and how fine-tuning can mitigate numerical instability.
Paper Structure (28 sections, 31 equations, 20 figures)

This paper contains 28 sections, 31 equations, 20 figures.

Figures (20)

  • Figure 1: Decision boundaries of a Bayesian neural network using the DirClip prior. By varying the concentration parameter of the prior, we can control the model's aleatoric uncertainty, leading to different decision boundaries. The plotted decision boundaries were obtained using Hamiltonian Monte Carlo, using the dataset from Figure 1 of ndg.
  • Figure 2: Confidence of ResNet20 trained on CIFAR-10 with a Normal prior. The dashed line shows the average confidence of prior samples as a function of the prior scale (standard deviation). The relationship is one-to-one: the prior scale exactly determines prior confidence. Conversely, the prior scale has almost no effect on posterior confidence---each scatter point corresponds to a single trained model. Here, the intuition that "prior confidence translates into posterior confidence" fails. Instead, the posterior confidence depends mostly on the posterior temperature, visualized using the colorbar on the right.
  • Figure 3: Slices of various prior, likelihood, and posterior distributions. For each distribution, we assume that there are only two classes and we vary the predicted probability of the true class on the x-axis. Since the prior has no notion of the "true" class, it is symmetric. Note that the x-axis is non-linear to better show the tail behavior of each distribution. Notably, the NDG prior peaks at a very small (and large) value of predicted probability, which would not be visible on a linear scale. The blue and green stars in the left and right plots show local maxima.
  • Figure 4: Factorized NDG. This figure shows the accuracy of ResNet20 on CIFAR-10 with data augmentation for various BNN posteriors. Each posterior consists of a $\mathcal{N}(0, 0.1^2)$ prior over model parameters, a (modified) likelihood, and optionally an additional prior term over predictions. The model using the standard categorical ($T$$=$$1$) likelihood provides a simple baseline. The NDG posterior models defined over logits and log-probabilities both reach the same test accuracy, on par with a cold posterior. In contrast, the NDG prior and likelihood on their own do not match the performance of a cold posterior. Note that the training accuracy was evaluated on posterior samples, whereas the test accuracy was evaluated on the posterior ensemble.
  • Figure 5: DirClip accuracy. This figure shows the accuracy of ResNet20 on CIFAR-10 with data augmentation for various BNN posteriors. Each solid line corresponds to a different clipping value of the DirClip prior (printed in the legend). The blue line shows DirClip posteriors sampled from random initialization; all other DirClip posteriors are fine-tuned from a checkpoint with 100% training accuracy. Note that the training accuracy was evaluated on posterior samples, whereas the test accuracy was evaluated on the posterior ensemble.
  • ...and 15 more figures