Table of Contents
Fetching ...

On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty

Joost van Amersfoort, Lewis Smith, Andrew Jesson, Oscar Key, Yarin Gal

TL;DR

This work analyzes why Deep Kernel Learning can yield unreliable uncertainty due to feature collapse and introduces Deterministic Uncertainty Estimation (DUE), which constrains the deep feature extractor to be bi-Lipschitz via spectral normalization and residual connections. By pairing a bi-Lipschitz feature space with a sparse variational Gaussian process on a small set of inducing points, DUE preserves the non-parametric uncertainty properties of GPs while enabling single forward-pass predictions. Empirically, DUE outperforms previous single-pass uncertainty methods on CIFAR-10 vs SVHN and a regression benchmark for personalized healthcare, while training end-to-end from scratch with modest inducing-point counts. The approach offers practical, scalable uncertainty estimation with real-time applicability, though it does not guarantee correctness of uncertainty in all cases and highlights future work to strengthen theoretical guarantees and assess societal impact.

Abstract

Inducing point Gaussian process approximations are often considered a gold standard in uncertainty estimation since they retain many of the properties of the exact GP and scale to large datasets. A major drawback is that they have difficulty scaling to high dimensional inputs. Deep Kernel Learning (DKL) promises a solution: a deep feature extractor transforms the inputs over which an inducing point Gaussian process is defined. However, DKL has been shown to provide unreliable uncertainty estimates in practice. We study why, and show that with no constraints, the DKL objective pushes "far-away" data points to be mapped to the same features as those of training-set points. With this insight we propose to constrain DKL's feature extractor to approximately preserve distances through a bi-Lipschitz constraint, resulting in a feature space favorable to DKL. We obtain a model, DUE, which demonstrates uncertainty quality outperforming previous DKL and other single forward pass uncertainty methods, while maintaining the speed and accuracy of standard neural networks.

On Feature Collapse and Deep Kernel Learning for Single Forward Pass Uncertainty

TL;DR

This work analyzes why Deep Kernel Learning can yield unreliable uncertainty due to feature collapse and introduces Deterministic Uncertainty Estimation (DUE), which constrains the deep feature extractor to be bi-Lipschitz via spectral normalization and residual connections. By pairing a bi-Lipschitz feature space with a sparse variational Gaussian process on a small set of inducing points, DUE preserves the non-parametric uncertainty properties of GPs while enabling single forward-pass predictions. Empirically, DUE outperforms previous single-pass uncertainty methods on CIFAR-10 vs SVHN and a regression benchmark for personalized healthcare, while training end-to-end from scratch with modest inducing-point counts. The approach offers practical, scalable uncertainty estimation with real-time applicability, though it does not guarantee correctness of uncertainty in all cases and highlights future work to strengthen theoretical guarantees and assess societal impact.

Abstract

Inducing point Gaussian process approximations are often considered a gold standard in uncertainty estimation since they retain many of the properties of the exact GP and scale to large datasets. A major drawback is that they have difficulty scaling to high dimensional inputs. Deep Kernel Learning (DKL) promises a solution: a deep feature extractor transforms the inputs over which an inducing point Gaussian process is defined. However, DKL has been shown to provide unreliable uncertainty estimates in practice. We study why, and show that with no constraints, the DKL objective pushes "far-away" data points to be mapped to the same features as those of training-set points. With this insight we propose to constrain DKL's feature extractor to approximately preserve distances through a bi-Lipschitz constraint, resulting in a feature space favorable to DKL. We obtain a model, DUE, which demonstrates uncertainty quality outperforming previous DKL and other single forward pass uncertainty methods, while maintaining the speed and accuracy of standard neural networks.

Paper Structure

This paper contains 22 sections, 2 theorems, 11 equations, 7 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

The marginal likelihood of a GP with a neural feature extractor, i.e. with a kernel function $k(f_\theta(\cdot), f_\theta(\cdot))$ (DKL) where $f$ is a deep neural network parameterised by $\theta$, can be made arbitrarily large if the feature extractor $f_\theta$ is allowed to map data points arbit

Figures (7)

  • Figure 1: In green 300 example training data points and in blue the prediction including uncertainty (one and two std). We see that DUE performs well when trained with 1 thousand (1K) datapoints and 1 million (1M) data points. Meanwhile, the RFF approximation in SNGP concentrates its uncertainty at 1M, and is very uncertain at 1K. This highlights a drawback of the parametric RFF approximation.
  • Figure 2: A 2D classification task where the classes are two Gaussian blobs (drawn in green), and a grid of unrelated points (colored according to their log-probability under the data generating distribution). We additionally mark a specific point with a star. In (b), the features as computed by an unconstrained model. In (c), the features computed by a model with residual connections and spectral normalization. The objective for the unconstrained model introduces a large amount of distortion of the space, collapsing the input to a single line, making it almost impossible to use distance-sensitive measures on these features. In particular, the star moves from an unrelated area in input space on top of class data in feature space. In contrast, the constrained mapping maintains the relative distances of the other points.
  • Figure 3: We show uncertainty results on the two moons dataset. Yellow indicates high confidence, while blue indicates uncertainty. In Figure \ref{['fig:twomoons_softmax']}, a simple feed-forward ResNet with a softmax output is certain everywhere except on the decision boundary. In Figure \ref{['fig:twomoons_ffn']}, we see that GPDNN, which uses a simple Feed-Forward Network as feature extractor, is certain even far away from the training data. In Figure \ref{['fig:twomoons_DUE']}, we show DUE, which has the appropriate restrictions on the feature extractor (residual connections and spectral normalization) and obtains close to ideal uncertainty on this dataset.
  • Figure 4: Predicted CATE versus true CATE with 95% confidence intervals for a randomly chosen cross-validation run. DUE is confident (without converging to no uncertainty) and correct, while BTARNet shalit2017estimating is wrong in $2$ instances and the true CATE is not within the confidence interval.
  • Figure 5: A density of the Lipschitz values in batch normalization layers, averaged across 15 WRN models that were trained with Softmax output and without spectral normalization (exactly following zagoruyko2016wide). We see that many of the constants are significantly above 1, highlighting that batch normalization has significant impact on the Lipschitz constant of the network.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof : Informal Proof of Proposition \ref{['prop:feature_collapse']}
  • Lemma 1
  • proof : Proof of Lemma \ref{['lem:ober']}