Activation-Space Uncertainty Quantification for Pretrained Networks

Richard Bergna; Stefan Depeweg; Sergio Calvo-Ordoñez; Jonathan Plenk; Alvaro Cartea; Jose Miguel Hernández-Lobato

Activation-Space Uncertainty Quantification for Pretrained Networks

Richard Bergna, Stefan Depeweg, Sergio Calvo-Ordoñez, Jonathan Plenk, Alvaro Cartea, Jose Miguel Hernández-Lobato

TL;DR

Gaussian Process Activations (GAPA) is introduced, a post-hoc method that shifts Bayesian modeling from weights to activations, enabling deterministic single-pass uncertainty propagation without sampling, backpropagation, or second-order information.

Abstract

Reliable uncertainty estimates are crucial for deploying pretrained models; yet, many strong methods for quantifying uncertainty require retraining, Monte Carlo sampling, or expensive second-order computations and may alter a frozen backbone's predictions. To address this, we introduce Gaussian Process Activations (GAPA), a post-hoc method that shifts Bayesian modeling from weights to activations. GAPA replaces standard nonlinearities with Gaussian-process activations whose posterior mean exactly matches the original activation, preserving the backbone's point predictions by construction while providing closed-form epistemic variances in activation space. To scale to modern architectures, we use a sparse variational inducing-point approximation over cached training activations, combined with local k-nearest-neighbor subset conditioning, enabling deterministic single-pass uncertainty propagation without sampling, backpropagation, or second-order information. Across regression, classification, image segmentation, and language modeling, GAPA matches or outperforms strong post-hoc baselines in calibration and out-of-distribution detection while remaining efficient at test time.

Activation-Space Uncertainty Quantification for Pretrained Networks

TL;DR

Abstract

Paper Structure (97 sections, 1 theorem, 70 equations, 25 figures, 13 tables)

This paper contains 97 sections, 1 theorem, 70 equations, 25 figures, 13 tables.

Introduction
Model Proposition
Method pipeline.
Uncertainty Modeling Perspective
Weight-space uncertainty.
Activation-space uncertainty (GAPA).
Gaussian Process Activation Function
Setup.
GP activation.
Data collation for the GP.
Mean preservation.
Posterior covariance.
Why diagonal covariance?
Local Inducing-Point Approximation
Stage 1: Inducing-point construction (offline).
...and 82 more sections

Key Result

Lemma 1

Consider a Gaussian process prior with fixed kernel hyperparameters and Gaussian observation noise. Let $\tilde{Z}$ denote a set of inducing inputs and let $\tilde{Z}_k \subset \tilde{Z}$ be any subset. Then, for any test input $z^*$, the posterior variance satisfies That is, conditioning on a subset of inducing inputs cannot reduce posterior variance relative to conditioning on the full inducing

Figures (25)

Figure 1: Comparison of uncertainty quantification methods on a toy binary classification task. Left to right: MAP (deterministic backbone), MC Dropout, Last-Layer Laplace, and GAPA (ours). Background shading indicates predictive confidence (darker = more confident); orange/yellow points show the two classes. Key observation: GAPA preserves the backbone's decision boundary (black line) exactly while adding epistemic uncertainty that grows smoothly away from training data.
Figure 2: GAPA overview. Top: GAPA leaves the network’s point predictions unchanged (mean-preserving activations) while propagating an additional epistemic variance signal to the output. Bottom left: deterministic $\tanh$ activation; orange points denote cached training activations. Bottom right: GAPA-$\tanh$, whose posterior mean matches $\tanh$ exactly; the shaded region shows $\pm 2$ standard deviations.
Figure 3: Predictive NLL under rotation corruption for MNIST (left panel) and FMNIST (right panel); lower is better. Results are averaged over 5 random seeds.
Figure 4: OOD detection vs inference cost on CIFAR-10. OOD AUROC is plotted against test-time inference cost (log scale) for ResNet backbones. Dashed lines indicate Pareto frontiers (higher OOD, lower cost). GAPA consistently lies on the frontier, achieving strong OOD performance at substantially lower inference cost than baselines.
Figure 5: Left: Effect of number of inducing points $N_{\text{inducing}}$ and $k$ (for nearest neighbor inducing points) on OOD detection task with GAPA at layer [27]. Right: effect of layer placement of GAPA at $N_{\text{inducing}}=10^5$. In both experiments results are averaged over $5$ runs with $512$ sequences each. In both panels we also show the $\ell/T_{\mathrm{opt}}$ bound (green) as an upper threshold of what can be achieved by global logits scaling.
...and 20 more figures

Theorems & Definitions (2)

Lemma 1: Conservative uncertainty under subset conditioning
proof

Activation-Space Uncertainty Quantification for Pretrained Networks

TL;DR

Abstract

Activation-Space Uncertainty Quantification for Pretrained Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (2)