Table of Contents
Fetching ...

Can Bayesian Neural Networks Make Confident Predictions?

Katharine Fisher, Youssef Marzouk

TL;DR

This work analyzes Bayesian neural networks with a discrete prior on interior-layer weights and a Gaussian prior on final-layer weights, deriving that the posterior predictive is a $J$-component Gaussian mixture. It demonstrates that different interior-parameter realizations can map to distinct predictive modes, leading to multimodality in the predictive distribution even as data and network size scale. The study reveals that, under overparameterization, the posterior predictive may fail to contract with increasing data, challenging the assumption that full Bayesian posteriors always provide confident predictions. These findings have implications for the interpretation of Bayesian uncertainty in large neural networks and for designing predictive distributions that balance accuracy and calibration.

Abstract

Bayesian inference promises a framework for principled uncertainty quantification of neural network predictions. Barriers to adoption include the difficulty of fully characterizing posterior distributions on network parameters and the interpretability of posterior predictive distributions. We demonstrate that under a discretized prior for the inner layer weights, we can exactly characterize the posterior predictive distribution as a Gaussian mixture. This setting allows us to define equivalence classes of network parameter values which produce the same likelihood (training error) and to relate the elements of these classes to the network's scaling regime -- defined via ratios of the training sample size, the size of each layer, and the number of final layer parameters. Of particular interest are distinct parameter realizations that map to low training error and yet correspond to distinct modes in the posterior predictive distribution. We identify settings that exhibit such predictive multimodality, and thus provide insight into the accuracy of unimodal posterior approximations. We also characterize the capacity of a model to "learn from data" by evaluating contraction of the posterior predictive in different scaling regimes.

Can Bayesian Neural Networks Make Confident Predictions?

TL;DR

This work analyzes Bayesian neural networks with a discrete prior on interior-layer weights and a Gaussian prior on final-layer weights, deriving that the posterior predictive is a -component Gaussian mixture. It demonstrates that different interior-parameter realizations can map to distinct predictive modes, leading to multimodality in the predictive distribution even as data and network size scale. The study reveals that, under overparameterization, the posterior predictive may fail to contract with increasing data, challenging the assumption that full Bayesian posteriors always provide confident predictions. These findings have implications for the interpretation of Bayesian uncertainty in large neural networks and for designing predictive distributions that balance accuracy and calibration.

Abstract

Bayesian inference promises a framework for principled uncertainty quantification of neural network predictions. Barriers to adoption include the difficulty of fully characterizing posterior distributions on network parameters and the interpretability of posterior predictive distributions. We demonstrate that under a discretized prior for the inner layer weights, we can exactly characterize the posterior predictive distribution as a Gaussian mixture. This setting allows us to define equivalence classes of network parameter values which produce the same likelihood (training error) and to relate the elements of these classes to the network's scaling regime -- defined via ratios of the training sample size, the size of each layer, and the number of final layer parameters. Of particular interest are distinct parameter realizations that map to low training error and yet correspond to distinct modes in the posterior predictive distribution. We identify settings that exhibit such predictive multimodality, and thus provide insight into the accuracy of unimodal posterior approximations. We also characterize the capacity of a model to "learn from data" by evaluating contraction of the posterior predictive in different scaling regimes.
Paper Structure (19 sections, 21 equations, 16 figures)

This paper contains 19 sections, 21 equations, 16 figures.

Figures (16)

  • Figure 1: Left and center: posterior predictive distributions for input dimension $d=100$ at select training set sizes $n$ and final layer widths $p$, as indicated by each title. The black line shows the pdf which is a mixture of Gaussians. Each shaded distribution is a component of this mixture with transparency corresponding to its weight. Right: Heatmaps depicting the log of the number of component distributions which have weight larger than $10^{-6}$ for specified network dimensions. Observation noise variance is set to $\gamma^2=0.01$ for these results.
  • Figure 2: Top left and bottom: Predictive distributions based on candidate parameters constructed to achieve \ref{['eq:conjecture']}. The full distribution is plotted in black and components are shaded according to their weight in indigo. We consider $10$ rotations, $10$ preimage samples, and $10$ column space samples to construct the distribution --- a total of $1000$ samples. Top right: The scale of predictive distributions for select $d$ and $p/n$ where $n/d=0.7$. We plot the mean and standard error obtained from $10$ realizations of $Y$ for which we find the median predictive variance across $10$ realizations of $\widetilde{x}_1$.
  • Figure 3: Posterior predictive distributions at test point $\widetilde{x}_1^{(2)}$ for input dimension $d=100$ at select training set sizes $n$ and final layer widths $p$, as indicated by each title. The black line shows the pdf which is a mixture of Gaussians. Each shaded distribution is a component of this mixture with transparency corresponding to its weight.
  • Figure 4: Posterior predictive distributions at test point $\widetilde{x}_1^{(3)}$ for input dimension $d=100$ at select training set sizes $n$ and final layer widths $p$, as indicated by each title. The black line shows the pdf which is a mixture of Gaussians. Each shaded distribution is a component of this mixture with transparency corresponding to its weight.
  • Figure 5: Posterior predictive distributions at test point $\widetilde{x}_1^{(2)}$ for input dimension $d=1000$ at select training set sizes $n$ and final layer widths $p$, as indicated by each title. The black line shows the pdf which is a mixture of Gaussians. Each shaded distribution is a component of this mixture with transparency corresponding to its weight.
  • ...and 11 more figures