Vecchia Gaussian Process Ensembles on Internal Representations of Deep Neural Networks

Felix Jimenez; Matthias Katzfuss

Vecchia Gaussian Process Ensembles on Internal Representations of Deep Neural Networks

Felix Jimenez, Matthias Katzfuss

TL;DR

The paper introduces the deep Vecchia ensemble (DVE), a deterministic uncertainty quantification (UQ) framework that leverages multiple intermediate representations of pretrained neural networks. By building an ensemble of Vecchia Gaussian processes on layer-specific embeddings and fusing their predictions via a product-of-experts, DVE achieves scalable UQ without retraining and can distinguish aleatoric from epistemic uncertainty. The method demonstrates competitive RMSE and improved NLL on UCI regression tasks and chemical-property prediction, while providing interpretable conditioning-sets that reveal which training points influence a test prediction. DVE addresses feature collapse and enables uncertainty estimates for pretrained models, with potential applications in latent-space optimization and robust decision-making. Limitations include reliance on access to training data and Gaussian-likelihood assumptions, suggesting future work on non-Gaussian likelihoods and integration with Bayesian weight models.

Abstract

For regression tasks, standard Gaussian processes (GPs) provide natural uncertainty quantification (UQ), while deep neural networks (DNNs) excel at representation learning. Deterministic UQ methods for neural networks have successfully combined the two and require only a single pass through the neural network. However, current methods necessitate changes to network training to address feature collapse, where unique inputs map to identical feature vectors. We propose an alternative solution, the deep Vecchia ensemble (DVE), which allows deterministic UQ to work in the presence of feature collapse, negating the need for network retraining. DVE comprises an ensemble of GPs built on hidden-layer outputs of a DNN, achieving scalability via Vecchia approximations that leverage nearest-neighbor conditional independence. DVE is compatible with pretrained networks and incurs low computational overhead. We demonstrate DVE's utility on several datasets and carry out experiments to understand the inner workings of the proposed method.

Vecchia Gaussian Process Ensembles on Internal Representations of Deep Neural Networks

TL;DR

Abstract

Paper Structure (53 sections, 1 theorem, 7 equations, 9 figures, 7 tables)

This paper contains 53 sections, 1 theorem, 7 equations, 9 figures, 7 tables.

INTRODUCTION
PRELIMINARIES
Modeling Functions with Gaussian Processes
Approximating Gaussian Processes Using Vecchia
Ensembling Gaussian Processes
Extracting Information from Intermediate Representations
Aleatoric and Epistemic Uncertainty
RELATED WORK
DEEP VECCHIA ENSEMBLE
Dataset Extraction from a Neural Network
Conditioning-Set Selection
Ensemble Building from Conditioning Sets
APPLICATION TO PRETRAINED MODELS
UCI Benchmarks
Chemical Property Prediction
...and 38 more sections

Key Result

Proposition 1

Consider a sequence of injective functions $f_1, .., f_L, f_{L+1}$ and a sequence of metric spaces $\{(M_i, d_i)\}_{i=1}^{L+1}$ such that $f_i: \mathcal{M}_i \rightarrow \mathcal{M}_{i+1}$ for $i = 1, ..., L$. Then $\{(M_i, \tilde{d}_i)\}_{i=2}^{L+1}$ defines a sequence of metric spaces, where $\til

Figures (9)

Figure 1: Different layers imply different nearest neighbors. Left: The input (red star) has different nearest-neighbor conditioning sets based on the metrics induced by the layers of the DNN with the magenta point being in two of the three conditioning sets. The brown, blue, and pink shaded areas denote the regions in input space that will be mapped to a hypersphere in the first, second, and third intermediate spaces, respectively. The conditioning sets derived from the different regions may overlap, as in the blue and brown region, or be disjoint from the others as in the pink region. Right: The labeled training data are propagated through the network and intermediate feature maps are stored. For a red test point, we assess uncertainty by considering and weighting instances in the training data that are similar to the test sequence in one or more of the feature maps.
Figure 2: The DVE pipeline is conceptually simple. The network maps inputs $\bm x_1, \bm x_2, \bm x_3, \ldots, \bm x_n$, to intermediate spaces, where we compute nearest neighbors. For example, in $\mathcal{A}_1$, the neighbors for $\textcolor{red}{x^*}$ are $\bm e_1^1 , \bm e_1^3, \bm e_1^4$ (for simplicity, we define $\bm e_k^i=\bm e_k(\bm x_i)$). For each layer, we define a Vecchia GP, denoted by $\mathcal{V}_1, \ldots, \mathcal{V}_L$, which estimates a distribution for the response $\textcolor{red}{y^*}$. Estimates are combined in a product-of-experts fashion to yield a single distribution parameterized by $\hat{\mu}$ and $\hat{\sigma}$, which improves upon the network's point prediction $\textcolor{red}{\hat{y}^*}$.
Figure 3: Sensitivity and smoothness vary by layer. The 2D TSNE projection of input points in the original space of the bike UCI dataset (top left panel) and intermediate spaces (remaining five panels) with the color denoting the response value. The blue dot is a fixed test point that has been propagated through the network. The black crosses denote the ordered nearest neighbors of the blue dot in the original input space, and their corresponding intermediate representations. The magenta crosses denote the ordered points nearest to the blue point within each intermediate space.
Figure 4: Ensembling layers can outperform last layer in terms of MSE. Top left: Data from the S-curve dataset. Top middle: Predictions generated by DVE. Top right: Predictions from the neural network. In all panels, the color denotes the response value. Arrows emphasize that DVE predictions use all three layers, whereas the final network prediction uses only the last layer.
Figure 5: Layerwise conditioning sets relate to aleatoric uncertainty. Top left: A test image of a digit 9 incorrectly classified as a 5 by the neural network. Top right: Nearest-neighbor images for the misclassified 9 at the third (top row) and fourth (bottom row) layers of the network, with the top row showing neighbors of the correct class (9) and the bottom row indicating emergence of the incorrect class (5). Bottom half: For a correctly labeled test image of an 8 (bottom left), the nearest-neighbor images at both the third (top row) and fourth (bottom row) layers (bottom right) consistently show the correct class (8).
...and 4 more figures

Theorems & Definitions (1)

Proposition 1

Vecchia Gaussian Process Ensembles on Internal Representations of Deep Neural Networks

TL;DR

Abstract

Vecchia Gaussian Process Ensembles on Internal Representations of Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)