(Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

Andreas Kirsch

(Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

Andreas Kirsch

TL;DR

The paper investigates a paradox in uncertainty quantification: as models and ensembles grow, epistemic uncertainty can collapse, undermining reliable reliability estimates. It develops a theoretical framework around ensembles of ensembles and the implicit ensembling hypothesis, with connections to Neural Tangent Kernel theory, and validates the phenomenon across toy tasks, MNIST, CIFAR-10, and large vision models, including ResNets and Vision Transformers. A key result shows that as sub-ensemble size $M$ increases, $ ext{MI}(Y;oldsymbol{\E}_I|oldsymbol{x})$ tends to zero, indicating vanishing disagreement between ensembles, while implicit ensemble extraction can recover much of the lost uncertainty from a single large model. The work suggests that naive scaling does not guarantee improved uncertainty estimates and offers practical techniques for recovering epistemic uncertainty, with significant implications for safety-critical applications and out-of-distribution detection.

Abstract

Epistemic uncertainty is crucial for safety-critical applications and data acquisition tasks. Yet, we find an important phenomenon in deep learning models: an epistemic uncertainty collapse as model complexity increases, challenging the assumption that larger models invariably offer better uncertainty quantification. We introduce implicit ensembling as a possible explanation for this phenomenon. To investigate this hypothesis, we provide theoretical analysis and experiments that demonstrate uncertainty collapse in explicit ensembles of ensembles and show experimental evidence of similar collapse in wider models across various architectures, from simple MLPs to state-of-the-art vision models including ResNets and Vision Transformers. We further develop implicit ensemble extraction techniques to decompose larger models into diverse sub-models, showing we can thus recover epistemic uncertainty. We explore the implications of these findings for uncertainty estimation.

(Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

TL;DR

increases,

tends to zero, indicating vanishing disagreement between ensembles, while implicit ensemble extraction can recover much of the lost uncertainty from a single large model. The work suggests that naive scaling does not guarantee improved uncertainty estimates and offers practical techniques for recovering epistemic uncertainty, with significant implications for safety-critical applications and out-of-distribution detection.

Abstract

Paper Structure (30 sections, 17 equations, 7 figures, 6 tables)

This paper contains 30 sections, 17 equations, 7 figures, 6 tables.

Introduction
Theoretical Framework
Deep Ensembles
Epistemic Uncertainty Collapse for Ensembles of Ensembles
Setup and Predictive Distribution
Limit of Increasing Sub-Ensemble Size
Chain Rule Decomposition
Implicit Ensembling
Connection to Neural Tangent Kernel Theory
Empirical Results
Implicit Ensemble Extraction
Related Work
Conclusion
Theoretical Framework
Variance-Based Epistemic Uncertainty
...and 15 more sections

Figures (7)

Figure 1: Epistemic Uncertainty Collapse in a Toy Regression Problem. As the sub-ensemble size increases, epistemic uncertainty vanishes. Ensembles of 10 sub-ensembles with different sub-ensemble sizes. Left: True function, data, and ensemble predictions. Middle: Epistemic uncertainty across input space. Right: Mean epistemic uncertainty vs. sub-ensemble size.
Figure 2: Ensemble of Ensemble Results for CIFAR10 (iD) vs. SVHN (OoD). Different configurations of 24 ResNet-50 models trained on CIFAR-10. (a) As the sub-ensemble size increases, the epistemic uncertainty on SVHN as OoD dataset collapses. (b) The area under the receiver-operating characteristic (AUROC $\uparrow$) for OoD detection using mutual information slowly deteriorates as the sub-ensemble size increases.
Figure 3: Epistemic Uncertainty Collapse on MNIST via Implicit Ensembling.(a)Mutual Information Empirical Cumulative Distribution Function (ECDF) for Different MLP Widths. As MLP size increases, mutual information decreases while accuracy remains stable. This trend persists across training and other distributions. (b)MNIST vs. Fashion-MNIST OoD Detection AUROC Curves. The mean difference in uncertainty scores between in-distribution and out-of-distribution samples (in parentheses) also decreases with width, further evidencing epistemic uncertainty collapse, while the AUROC for OoD detection slightly improves across both uncertainty metrics.
Figure 4: Recovering Epistemic Uncertainty through Implicit Ensemble Extraction.(a) The extracted implicit ensemble (dashed line) largely recovers the mutual information scores of a fully trained ensemble of the same width, supporting the hypothesis of latent ensemble structures in large neural networks. The 10-member implicit ensemble is extracted from a single MLP with width factor 64 (\ref{['sec:extracted_implicit_ensemble']}). The regular 10-member ensembles comprise MLPs with width factors 4 and 64 trained on MNIST. Ensembles are evaluated on MNIST, Dirty-MNIST, and Fashion-MNIST test sets. (b) The extracted implicit ensemble shows comparable AUROC scores across all metrics relative to a fully trained deep ensemble of the same width. OoD detection is performed using mutual information or entropy scores. The final panel compares the softmax entropy of the original wide MLP with the predictive entropy of its extracted implicit ensemble. The mean entropy difference between iD and OoD samples is larger for the extracted ensemble. At the same time, the OoD performance does not match the single wider MLP.
Figure 5: Classification with Rejection for Implicit Ensemble Extractions from Pre-Trained Models. Each subfigure shows three performance metrics (Accuracy, Negative Log-Likelihood, and Calibration Error) as a function of epistemic uncertainty quantiles for different ensemble sizes. Solid lines represent extracted ensembles of increasing size (from 2 to 7/16), while the dashed black line represents the original single model. (a) The mutual information between predictions is used as the epistemic uncertainty measure for ensembles, while entropy is used for the single model. As the ensemble size increases, we observe improved performance for the area under curve (AUC), which indicates better epistemic uncertainty calibration (with the notable exception of the calibration error). This demonstrates that extracting larger ensembles from a single pre-trained model can enhance performance and uncertainty quantification. (b) Temperature scaling improves epistemic uncertainty calibration in general but benefits the original model most. Accuracy and NLL for extracted epistemic uncertainty only benefit in the low-uncertainty regime. (c) For VIT models, we find that a mutual information weighted by the logit sum of each ensemble performs better than the mutual information ((c) vs (d) with mutual information).
...and 2 more figures

(Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

TL;DR

Abstract

(Implicit) Ensembles of Ensembles: Epistemic Uncertainty Collapse in Large Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)