Table of Contents
Fetching ...

Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning

Luis A. Ortega, Simón Rodríguez-Santana, Daniel Hernández-Lobato

TL;DR

The paper tackles post-hoc uncertainty estimation for pre-trained deep neural networks by fixing the Gaussian Process mean to the DNN output using a universal kernel. It introduces Fixed-Mean Gaussian Processes (FMGP), a decoupled sparse GP variational framework that fixes the mean while learning variances, avoiding DNN Jacobians and enabling scalable uncertainty estimates on large datasets. Across synthetic problems, CIFAR10, ImageNet, and QM9, FMGP achieves robust uncertainty calibration and competitive or superior predictive performance with favorable training and inference times compared to state-of-the-art post-hoc methods. The approach is architecture-agnostic and scalable, offering a practical pathway to robust post-hoc Bayesian deep learning with broad applicability and potential for kernel customization.

Abstract

Recently, there has been an increasing interest in performing post-hoc uncertainty estimation about the predictions of pre-trained deep neural networks (DNNs). Given a pre-trained DNN via back-propagation, these methods enhance the original network by adding output confidence measures, such as error bars, without compromising its initial accuracy. In this context, we introduce a novel family of sparse variational Gaussian processes (GPs), where the posterior mean is fixed to any continuous function when using a universal kernel. Specifically, we fix the mean of this GP to the output of the pre-trained DNN, allowing our approach to effectively fit the GP's predictive variances to estimate the DNN prediction uncertainty. Our approach leverages variational inference (VI) for efficient stochastic optimization, with training costs that remain independent of the number of training points, scaling efficiently to large datasets such as ImageNet. The proposed method, called fixed mean GP (FMGP), is architecture-agnostic, relying solely on the pre-trained model's outputs to adjust the predictive variances. Experimental results demonstrate that FMGP improves both uncertainty estimation and computational efficiency when compared to state-of-the-art methods.

Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning

TL;DR

The paper tackles post-hoc uncertainty estimation for pre-trained deep neural networks by fixing the Gaussian Process mean to the DNN output using a universal kernel. It introduces Fixed-Mean Gaussian Processes (FMGP), a decoupled sparse GP variational framework that fixes the mean while learning variances, avoiding DNN Jacobians and enabling scalable uncertainty estimates on large datasets. Across synthetic problems, CIFAR10, ImageNet, and QM9, FMGP achieves robust uncertainty calibration and competitive or superior predictive performance with favorable training and inference times compared to state-of-the-art post-hoc methods. The approach is architecture-agnostic and scalable, offering a practical pathway to robust post-hoc Bayesian deep learning with broad applicability and potential for kernel customization.

Abstract

Recently, there has been an increasing interest in performing post-hoc uncertainty estimation about the predictions of pre-trained deep neural networks (DNNs). Given a pre-trained DNN via back-propagation, these methods enhance the original network by adding output confidence measures, such as error bars, without compromising its initial accuracy. In this context, we introduce a novel family of sparse variational Gaussian processes (GPs), where the posterior mean is fixed to any continuous function when using a universal kernel. Specifically, we fix the mean of this GP to the output of the pre-trained DNN, allowing our approach to effectively fit the GP's predictive variances to estimate the DNN prediction uncertainty. Our approach leverages variational inference (VI) for efficient stochastic optimization, with training costs that remain independent of the number of training points, scaling efficiently to large datasets such as ImageNet. The proposed method, called fixed mean GP (FMGP), is architecture-agnostic, relying solely on the pre-trained model's outputs to adjust the predictive variances. Experimental results demonstrate that FMGP improves both uncertainty estimation and computational efficiency when compared to state-of-the-art methods.

Paper Structure

This paper contains 21 sections, 2 theorems, 27 equations, 6 figures, 2 tables.

Key Result

Proposition 3

A SVGP with $q(\mathbf{u}) = \mathcal{N}(\bm \mu, \bm S)$ has a dual representation in $\mathcal{Q}$ where $\bm a = K(\mathbf{Z}, \mathbf{Z})^{-1}\bm \mu$ and $\bm{A} = K(\mathbf{Z}, \mathbf{Z})^{-1}\bm{S}K(\mathbf{Z}, \mathbf{Z})^{-1} - K(\mathbf{Z}, \mathbf{Z})^{-1}$.

Figures (6)

  • Figure 1: Predictive distribution (mean in black, $2\sigma$ shaded region) on a toy 1D regression dataset. The considered approaches include a $2$ hidden layer MLP with $50$ units trained using back-propagation (MAP), linearized Laplace approximation (LLA), fixed-mean Gaussian process (FMGP) with squared exponential kernel, mean-field variational inference (MFVI) for DNN fine-tuning, Gaussian process (GP) with squared exponential kernel, and Hamilton Monte Carlo (HMC). All methods' hyper-parameters are optimized using training data except HMC, which uses uniform hyper-priors.
  • Figure 2: Representation of the considered sets of variational Gaussian measures for fixed-mean Gaussian processes.
  • Figure 3: Results obtained in regression problems for different post-hoc methods. Triple bars are shown corresponding to Year, Airline and Taxi datasets, from left to right. MAP uncertainty is obtained using Gaussian noise optimized using a validation set. We report average results across $5$ different repetitions using different random seeds. Error-bars are shown but they are negligible in most cases.
  • Figure 4: Test results obtained in CIFAR10 for different pre-trained ResNet architectures. LLA employs last-layer approximation. Out-of-distribution AUC is computed on a binary classification task discriminating between CIFAR10 and SVHN data instances. For this, we use the entropy of the predictive distribution. We report average results across $5$ different repetitions using different random seeds. Error bars are shown but they are negligible in most cases.
  • Figure 5: Histograms of the entropy of the predictive distribution of each method. We plot histograms for each class label, across $5$ different repetitions using different random seeds. The class labels are CIFAR10 (in-distribution) and SVHN (out-of-distribution) instances. We consider the ResNet56 architecture. We also show the average AUC of each method.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Remark 1
  • Definition 2
  • Example 1
  • Proposition 3: See cheng2016incremental for further details
  • proof
  • Remark 4
  • Proposition 5
  • proof
  • Definition 6
  • Remark 7