Table of Contents
Fetching ...

Bayesian Hypernetworks

David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, Aaron Courville

TL;DR

The paper introduces Bayesian hypernetworks (BHNs), a flexible approach to Bayesian deep learning that uses invertible hypernetworks to transform simple noise into rich, multimodal posterior samples over primary-network parameters. By employing invertible generative models and a weight-normalization-based parametrization, BHNs enable efficient sampling and tractable entropy estimation within variational inference, scaling to large networks. Empirical results across classification, active learning, anomaly detection, and adversarial robustness show BHNs can match or exceed strong baselines and yield more calibrated uncertainty. The work demonstrates that expressive, correlated posterior modeling improves safety and reliability in practical deep learning tasks.

Abstract

We study Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork $\h$ is a neural network which learns to transform a simple noise distribution, $p(\vecε) = \N(\vec 0,\mat I)$, to a distribution $q(\pp) := q(h(\vecε))$ over the parameters $\pp$ of another neural network (the "primary network")\@. We train $q$ with variational inference, using an invertible $\h$ to enable efficient estimation of the variational lower bound on the posterior $p(\pp | \D)$ via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap iid sampling of~$q(\pp)$. In practice, Bayesian hypernets can provide a better defense against adversarial examples than dropout, and also exhibit competitive performance on a suite of tasks which evaluate model uncertainty, including regularization, active learning, and anomaly detection.

Bayesian Hypernetworks

TL;DR

The paper introduces Bayesian hypernetworks (BHNs), a flexible approach to Bayesian deep learning that uses invertible hypernetworks to transform simple noise into rich, multimodal posterior samples over primary-network parameters. By employing invertible generative models and a weight-normalization-based parametrization, BHNs enable efficient sampling and tractable entropy estimation within variational inference, scaling to large networks. Empirical results across classification, active learning, anomaly detection, and adversarial robustness show BHNs can match or exceed strong baselines and yield more calibrated uncertainty. The work demonstrates that expressive, correlated posterior modeling improves safety and reliability in practical deep learning tasks.

Abstract

We study Bayesian hypernetworks: a framework for approximate Bayesian inference in neural networks. A Bayesian hypernetwork is a neural network which learns to transform a simple noise distribution, , to a distribution over the parameters of another neural network (the "primary network")\@. We train with variational inference, using an invertible to enable efficient estimation of the variational lower bound on the posterior via sampling. In contrast to most methods for Bayesian deep learning, Bayesian hypernets can represent a complex multimodal approximate posterior with correlations between parameters, while enabling cheap iid sampling of~. In practice, Bayesian hypernets can provide a better defense against adversarial examples than dropout, and also exhibit competitive performance on a suite of tasks which evaluate model uncertainty, including regularization, active learning, and anomaly detection.

Paper Structure

This paper contains 22 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of BHNs (second and third) and a traditional non-Bayesian DNN (first) on the toy problem from Blundell2015. In the second subplot, we place a prior on the scaling factor $g$ and infer the posterior distribution using a BHN, while in the third subplot the hypernet is used to generate the whole weight matrices of the primary net. Each shaded region represents half a standard deviation in the posterior on the predictive mean. The red crosses are 50 examples from the training dataset.
  • Figure 2: Learning the identity function with an overparametrized network: $\hat{y}=a\cdot b\cdot x$. This parametrization results in symmetries shown by the dashed red lines, and the Bayesian hypernetwork assigns significant probability mass to both modes of the posterior ($a = b = 1$ and $a = b = -1$).
  • Figure 3: Box plot of performance across 10 trials. Bayesian hypernets (BHNs) with inverse autoregressive flows (IAF) consistently outperform the other methods.
  • Figure 4: Active learning: Bayesian hypernets outperform other approaches after sufficient acquisitions when warm-starting (left), for both random acquisition function (top) and BALD acquisition function (bottom). Warm-starting improves stability for all methods, but hurts performance for other approaches, compared with randomly re-initializing parameters as in Gal2016Active (right). We also note that the baseline model (no dropout) is competitive with MCdropout, and outperforms the Dropout baseline used by Gal2016Active. These curves are the average of three experiments.
  • Figure 5: Adversary detection: Horizontal axis is the step size of the FGS algorithm. While accuracy drops when more perturbation is added to the data (left), uncertainty measures also increase (first row). In particular, the BALD and Mean STD scores, which measure epistemic uncertainty, are strongly increasing for BHNs, but not for dropout. The second row and third row plots show results for adversary detection and error detection (respectively) in terms of the AUC of ROC ($y$-axis) with increasing perturbation along the $x$-axis. Gradient direction is estimated with one Monte Carlo sample of the weights/dropout mask.
  • ...and 2 more figures