Towards Improved Variational Inference for Deep Bayesian Models

Sebastian W. Ober

Towards Improved Variational Inference for Deep Bayesian Models

Sebastian W. Ober

TL;DR

This thesis investigates improving variational inference for deep Bayesian models by addressing the limitations of partial Bayesian treatments and posterior symmetries. It introduces a unified, correlation-aware variational posterior that links Bayesian neural networks and deep Gaussian processes via global inducing points, and then reframes deep models in Gram-space with deep Wishart processes to remove rotational symmetries and enhance ELBOs. The work demonstrates substantial gains in calibration, uncertainty quantification, and predictive performance in benchmarks such as CIFAR-10, while also revealing contexts (e.g., heavy hyperparameter regimes) where standard marginal likelihood-based training can overfit. Collectively, these contributions advance practical, scalable Bayesian deep learning with improved uncertainty, model selection potential, and a principled path toward fully Bayesian deep kernel methods. The results highlight both the promise and remaining challenges in leveraging VI for robust deep probabilistic modeling in vision and beyond.

Abstract

Deep learning has revolutionized the last decade, being at the forefront of extraordinary advances in a wide range of tasks including computer vision, natural language processing, and reinforcement learning, to name but a few. However, it is well-known that deep models trained via maximum likelihood estimation tend to be overconfident and give poorly-calibrated predictions. Bayesian deep learning attempts to address this by placing priors on the model parameters, which are then combined with a likelihood to perform posterior inference. Unfortunately, for deep models, the true posterior is intractable, forcing the user to resort to approximations. In this thesis, we explore the use of variational inference (VI) as an approximation, as it is unique in simultaneously approximating the posterior and providing a lower bound to the marginal likelihood. If tight enough, this lower bound can be used to optimize hyperparameters and to facilitate model selection. However, this capacity has rarely been used to its full extent for Bayesian neural networks, likely because the approximate posteriors typically used in practice can lack the flexibility to effectively bound the marginal likelihood. We therefore explore three aspects of Bayesian learning for deep models: 1) we ask whether it is necessary to perform inference over as many parameters as possible, or whether it is reasonable to treat many of them as optimizable hyperparameters; 2) we propose a variational posterior that provides a unified view of inference in Bayesian neural networks and deep Gaussian processes; 3) we demonstrate how VI can be improved in certain deep Gaussian process models by analytically removing symmetries from the posterior, and performing inference on Gram matrices instead of features. We hope that our contributions will provide a stepping stone to fully realize the promises of VI in the future.

Towards Improved Variational Inference for Deep Bayesian Models

TL;DR

Abstract

Paper Structure (154 sections, 2 theorems, 239 equations, 29 figures, 27 tables, 3 algorithms)

This paper contains 154 sections, 2 theorems, 239 equations, 29 figures, 27 tables, 3 algorithms.

Introduction
What do we want from a model?
Probabilistic modeling
Bayesian modeling
Gaussian processes
A refresher on variational inference
Variational inference for Gaussian processes
Thesis overview
Background: Deep Bayesian modeling
Neural networks
Bayesian neural networks
Approximate inference in BNNs
Tempered posteriors and the cold posterior effect
Deep Gaussian processes
Symmetries in deep models
...and 139 more sections

Key Result

Proposition 1

Consider the GP regression model as described in Eq. eq:dkl:gp-regression. Then, for any valid kernel function that can be written in the form $k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \hat{k}(\mathbf{x}, \mathbf{x}')$, where $\sigma_f^2$ is a learnable hyperparameter along with learnable noise $\sig

Figures (29)

Figure 1: Maximum likelihood fits for three models with different numbers $W$ of "Gaussian bump" features. The simplest model (a) cannot effectively model the data, whereas the most complicated model (c) overfits to noise. The best model is therefore a model with intermediate complexity (b).
Figure 2: Plots of the posterior predictives and posterior samples (gray) for the three models, along with their log marginal likelihoods (LMLs). For the posterior predictives, we plot the mean functions (blue line), with the shaded regions corresponding to one and two standard deviations. The intermediate model (b) has the best LML, as the model with the fewest features (a) cannot effectively model the data, and the model with the most features (c) is penalized for having too much complexity.
Figure 3: Plots of the posterior predictives and samples for GP models with squared exponential kernels, trained on data subsampled from the toy example given in snelson2006sparse, along with their log marginal likelihoods. The model in (a) has been left at its initial hyperparameter values, whereas the model in (b) is the result of learning the hyperparameters according to the log marginal likelihood.
Figure 4: Demonstration of predictive posteriors for VI for a BLR model ($W=12$) with two approximate posteriors: (a) a full-covariance Gaussian approximate posterior, and (b) a mean-field Gaussian approximate posterior. The full-covariance posterior is able to recover the true posterior as well as the true prior standard deviation $\alpha=0.155$, whereas the mean-field approximate posterior cannot, with a worse ELBO and biased $\alpha=0.092$.
Figure 5: Illustration of sparse variational inference for GPs with a squared exponential kernel. The approximate posterior formed by taking only 2 inducing points (a) neither models the data well, nor provides sensible hyperparameters: for instance, the lengthscale is chosen as $\ell_1 = 1.100$, as compared to $\ell_1 = 0.415$ given by the LML (cf. Fig. \ref{['fig:intro:exact_gpr']}). By contrast, 10 inducing points (b) provide a better model of the data, obtain a much better ELBO (cf. LML from Fig. \ref{['fig:intro:exact_gpr']} of -14.21), and a much better lengthscale of $\ell_1 = 0.526$.
...and 24 more figures

Theorems & Definitions (14)

Remark 1
Remark 2
Proposition 1
proof
Remark 3
Remark 4
Remark 5
Definition 1: The Wishart distribution; srivastava2003singular
Proposition 2: see e.g., Thm. 7.4 in polyanskiy2022information
Definition 2: The generalized singular Wishart distribution
...and 4 more

Towards Improved Variational Inference for Deep Bayesian Models

TL;DR

Abstract

Towards Improved Variational Inference for Deep Bayesian Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (14)