Table of Contents
Fetching ...

Variational Inference on the Final-Layer Output of Neural Networks

Yadi Wei, Roni Khardon

TL;DR

This paper proposes to combine the advantages of both approaches by performing Variational Inference in the Final layer Output space (VIFO), because the output space is much smaller than the parameter space.

Abstract

Traditional neural networks are simple to train but they typically produce overconfident predictions. In contrast, Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming due to the large parameter space. This paper proposes to combine the advantages of both approaches by performing Variational Inference in the Final layer Output space (VIFO), because the output space is much smaller than the parameter space. We use neural networks to learn the mean and the variance of the probabilistic output. Using the Bayesian formulation we incorporate collapsed variational inference with VIFO which significantly improves the performance in practice. On the other hand, like standard, non-Bayesian models, VIFO enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. Experiments show that VIFO provides a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.

Variational Inference on the Final-Layer Output of Neural Networks

TL;DR

This paper proposes to combine the advantages of both approaches by performing Variational Inference in the Final layer Output space (VIFO), because the output space is much smaller than the parameter space.

Abstract

Traditional neural networks are simple to train but they typically produce overconfident predictions. In contrast, Bayesian neural networks provide good uncertainty quantification but optimizing them is time consuming due to the large parameter space. This paper proposes to combine the advantages of both approaches by performing Variational Inference in the Final layer Output space (VIFO), because the output space is much smaller than the parameter space. We use neural networks to learn the mean and the variance of the probabilistic output. Using the Bayesian formulation we incorporate collapsed variational inference with VIFO which significantly improves the performance in practice. On the other hand, like standard, non-Bayesian models, VIFO enjoys simple training and one can use Rademacher complexity to provide risk bounds for the model. Experiments show that VIFO provides a good tradeoff in terms of run time and uncertainty quantification, especially for out of distribution data.
Paper Structure (48 sections, 5 theorems, 49 equations, 15 figures, 39 tables)

This paper contains 48 sections, 5 theorems, 49 equations, 15 figures, 39 tables.

Key Result

Theorem 3.1

Let $q(z|x)=\mathcal{N}(z|w^\top x, x^\top V x)$ be the variational predictive distribution of VIFO, where $w$ and $V$ are the parameters to be optimized, and let $p(z|X_N)=\mathcal{N}(z|m_0^\top X_N, X_N^\top S_0 X_N)$ and $q(z|X_N)=\mathcal{N}(z|w^\top X_N, X_N^\top V X_N)$ be a correlated and dat

Figures (15)

  • Figure 1: Predictive distribution of VIFO using an MLP. Blue points are training data generated from a sinusoidal function, red points are the predicted mean, shaded area indicates the 1 standard deviation. More details are in Appendix \ref{['sec:detail-artificial']}.
  • Figure 2: Induced predictions by learned prior distribution for different methods. Note that VI has a prior over weights and VIFO has a prior over $z$. For each method we sample values from the prior and calculate predictions $y$ based on the sampled values. We then plot the $y$ values. As we can see, VI-naive induces a uniform prior that does not capture the data distribution, VI-mean has an increased variance in areas where data is missing and VIFO-mean does so to a larger extent. Details are given in Appendix \ref{['sec:detail-artificial']}.
  • Figure 3: Test log loss ($\downarrow$) on PreResNet20. Dashed lines indicate the best version of VIFO. The error bar is three times of the standard deviation for better visualization and same for other figures.
  • Figure 4: ECE ($\downarrow$) on AlexNet and PreResNet20 under data shift. Dashed line indicates the best performance of VIFO. Numerical results are listed in the Appendix.
  • Figure 5: Entropy ($\uparrow$) on PreResNet20.
  • ...and 10 more figures

Theorems & Definitions (11)

  • Remark 2.1
  • Remark 2.2
  • Theorem 3.1
  • Theorem 3.2
  • Lemma 4.3
  • Corollary 4.4
  • Theorem 4.5
  • proof : Proof of Theorem \ref{['thm:linear']}
  • proof : Proof of Theorem \ref{['thm:not-recover-vi']}
  • proof : Proof of Lemma \ref{['lemma:rademacher-bivariate']}
  • ...and 1 more