Table of Contents
Fetching ...

Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks

Bálint Mucsányi, Michael Kirchhof, Seong Joon Oh

TL;DR

This work addresses the practical disentanglement of aleatoric and epistemic uncertainty by conducting the first large-scale benchmark of 19 QQU methods across 13 uncertainty tasks on ImageNet-1k and CIFAR-10. It rigorously tests two decomposition formulas (information-theoretic and Bregman) and a broad suite of distributional and deterministic estimators, using multiple aggregators and five seeds. The key finding is that none of the examined approaches truly disentangles the sources of uncertainty in practice; estimates are highly correlated, and task performance varies widely, indicating there is no one-size-fits-all solution. The study provides practical guidance on when to use specialized estimators per task, highlights opportunities for task-centric disentangled uncertainties, and emphasizes the need for broader ground-truth data for aleatoric uncertainty. All code, logs, and benchmarks are made available to support reproducibility and further research.

Abstract

Uncertainty quantification, once a singular task, has evolved into a spectrum of tasks, including abstained prediction, out-of-distribution detection, and aleatoric uncertainty quantification. The latest goal is disentanglement: the construction of multiple estimators that are each tailored to one and only one source of uncertainty. This paper presents the first benchmark of uncertainty disentanglement. We reimplement and evaluate a comprehensive range of uncertainty estimators, from Bayesian over evidential to deterministic ones, across a diverse range of uncertainty tasks on ImageNet. We find that, despite recent theoretical endeavors, no existing approach provides pairs of disentangled uncertainty estimators in practice. We further find that specialized uncertainty tasks are harder than predictive uncertainty tasks, where we observe saturating performance. Our results provide both practical advice for which uncertainty estimators to use for which specific task, and reveal opportunities for future research toward task-centric and disentangled uncertainties. All our reimplementations and Weights & Biases logs are available at https://github.com/bmucsanyi/untangle.

Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks

TL;DR

This work addresses the practical disentanglement of aleatoric and epistemic uncertainty by conducting the first large-scale benchmark of 19 QQU methods across 13 uncertainty tasks on ImageNet-1k and CIFAR-10. It rigorously tests two decomposition formulas (information-theoretic and Bregman) and a broad suite of distributional and deterministic estimators, using multiple aggregators and five seeds. The key finding is that none of the examined approaches truly disentangles the sources of uncertainty in practice; estimates are highly correlated, and task performance varies widely, indicating there is no one-size-fits-all solution. The study provides practical guidance on when to use specialized estimators per task, highlights opportunities for task-centric disentangled uncertainties, and emphasizes the need for broader ground-truth data for aleatoric uncertainty. All code, logs, and benchmarks are made available to support reproducibility and further research.

Abstract

Uncertainty quantification, once a singular task, has evolved into a spectrum of tasks, including abstained prediction, out-of-distribution detection, and aleatoric uncertainty quantification. The latest goal is disentanglement: the construction of multiple estimators that are each tailored to one and only one source of uncertainty. This paper presents the first benchmark of uncertainty disentanglement. We reimplement and evaluate a comprehensive range of uncertainty estimators, from Bayesian over evidential to deterministic ones, across a diverse range of uncertainty tasks on ImageNet. We find that, despite recent theoretical endeavors, no existing approach provides pairs of disentangled uncertainty estimators in practice. We further find that specialized uncertainty tasks are harder than predictive uncertainty tasks, where we observe saturating performance. Our results provide both practical advice for which uncertainty estimators to use for which specific task, and reveal opportunities for future research toward task-centric and disentangled uncertainties. All our reimplementations and Weights & Biases logs are available at https://github.com/bmucsanyi/untangle.
Paper Structure (107 sections, 35 equations, 51 figures, 6 tables)

This paper contains 107 sections, 35 equations, 51 figures, 6 tables.

Figures (51)

  • Figure 1: Decomposition formulas like in \ref{['eq:information_theoretical']} decompose second-order distributions into individual estimates for epistemic and aleatoric uncertainties. However, we find that the estimates are internally highly correlated. The density plot on the right shows this for the epistemic and aleatoric uncertainty estimates obtained from decomposing deep ensemble uncertainties on ImageNet-1k. This means that they capture the same notion of uncertainty in practice as opposed to two disentangled ones.
  • Figure 2: Rank correlation between the aleatoric and epistemic estimates obtained by the IT decomposition on ImageNet (left) and CIFAR-10 (right). The two uncertainty components are strongly correlated for most methods, violating a necessary condition of their disentanglement.
  • Figure 3: Performance of uncertainty quantification methods on epistemic (left) and aleatoric (right) uncertainty tasks on the ImageNet validation dataset.
  • Figure 4: ID predictive uncertainty evaluation on the ImageNet validation dataset. The Mahalanobis method is a specialized OOD detector that cannot differentiate between ID samples.
  • Figure 5: Expected calibration error on ImageNet.
  • ...and 46 more figures