Table of Contents
Fetching ...

On the Calibration of Epistemic Uncertainty: Principles, Paradoxes and Conflictual Loss

Mohammed Fellaji, Frédéric Pennerath, Brieuc Conan-Guez, Miguel Couceiro

TL;DR

Epistemic uncertainty estimates from common deep learning approaches often fail to calibrate objectively, exhibiting paradoxical behavior as data or model size changes. The authors formalize two guiding principles for epistemic uncertainty—data-related and model-related—and argue that practical posterior approximations cause observed violations. They introduce Conflictual Deep Ensembles with a class-specific bias regularizer, which restores both principles without sacrificing accuracy or calibration. Across MNIST, SVHN, and CIFAR10, Conflictual DE improves uncertainty reliability, OOD detection, and misclassification discrimination, offering a practical path toward more trustworthy epistemic uncertainty in real-world deployments.

Abstract

The calibration of predictive distributions has been widely studied in deep learning, but the same cannot be said about the more specific epistemic uncertainty as produced by Deep Ensembles, Bayesian Deep Networks, or Evidential Deep Networks. Although measurable, this form of uncertainty is difficult to calibrate on an objective basis as it depends on the prior for which a variety of choices exist. Nevertheless, epistemic uncertainty must in all cases satisfy two formal requirements: first, it must decrease when the training dataset gets larger and, second, it must increase when the model expressiveness grows. Despite these expectations, our experimental study shows that on several reference datasets and models, measures of epistemic uncertainty violate these requirements, sometimes presenting trends completely opposite to those expected. These paradoxes between expectation and reality raise the question of the true utility of epistemic uncertainty as estimated by these models. A formal argument suggests that this disagreement is due to a poor approximation of the posterior distribution rather than to a flaw in the measure itself. Based on this observation, we propose a regularization function for deep ensembles, called conflictual loss in line with the above requirements. We emphasize its strengths by showing experimentally that it restores both requirements of epistemic uncertainty, without sacrificing either the performance or the calibration of the deep ensembles.

On the Calibration of Epistemic Uncertainty: Principles, Paradoxes and Conflictual Loss

TL;DR

Epistemic uncertainty estimates from common deep learning approaches often fail to calibrate objectively, exhibiting paradoxical behavior as data or model size changes. The authors formalize two guiding principles for epistemic uncertainty—data-related and model-related—and argue that practical posterior approximations cause observed violations. They introduce Conflictual Deep Ensembles with a class-specific bias regularizer, which restores both principles without sacrificing accuracy or calibration. Across MNIST, SVHN, and CIFAR10, Conflictual DE improves uncertainty reliability, OOD detection, and misclassification discrimination, offering a practical path toward more trustworthy epistemic uncertainty in real-world deployments.

Abstract

The calibration of predictive distributions has been widely studied in deep learning, but the same cannot be said about the more specific epistemic uncertainty as produced by Deep Ensembles, Bayesian Deep Networks, or Evidential Deep Networks. Although measurable, this form of uncertainty is difficult to calibrate on an objective basis as it depends on the prior for which a variety of choices exist. Nevertheless, epistemic uncertainty must in all cases satisfy two formal requirements: first, it must decrease when the training dataset gets larger and, second, it must increase when the model expressiveness grows. Despite these expectations, our experimental study shows that on several reference datasets and models, measures of epistemic uncertainty violate these requirements, sometimes presenting trends completely opposite to those expected. These paradoxes between expectation and reality raise the question of the true utility of epistemic uncertainty as estimated by these models. A formal argument suggests that this disagreement is due to a poor approximation of the posterior distribution rather than to a flaw in the measure itself. Based on this observation, we propose a regularization function for deep ensembles, called conflictual loss in line with the above requirements. We emphasize its strengths by showing experimentally that it restores both requirements of epistemic uncertainty, without sacrificing either the performance or the calibration of the deep ensembles.
Paper Structure (19 sections, 1 theorem, 13 equations, 13 figures, 6 tables)

This paper contains 19 sections, 1 theorem, 13 equations, 13 figures, 6 tables.

Key Result

theorem thmcountertheorem

The mutual information metric satisfies the first principle in expectation with respect to new random iid samples $\mathcal{D}_2$, i.e.,

Figures (13)

  • Figure 1: Heatmaps of epistemic uncertainty (mutual information) on MNIST, SVHN, and CIFAR10 datasets; for MC-Dropout, label smoothing combined with MC-Dropout (MC-Dropout LS), EDL, Deep Ensembles (DE), and Conflictual DE. For each heatmap, the x-axis gives the sizes of the hidden layers and the y-axis gives the number of training samples. Both have logarithmic scales. Color scales are different. Epistemic uncertainty should decrease along the y-axis (data-related principle) and increase along the x-axis (model-related principle).
  • Figure 2: Heatmaps of AUROC for OOD detection based on epistemic uncertainty. Same representation as Fig. \ref{['fig:epistemic-reg']} but with the same color scale per dataset.
  • Figure 3: Heatmaps of the SCE. Same representation as Fig. \ref{['fig:auc-ood']}. Lower is better.
  • Figure 4: Heatmaps of the accuracy. Color scales are the same per dataset.
  • Figure 5: Heatmaps of the Brier score. Color scales are the same per dataset.
  • ...and 8 more figures

Theorems & Definitions (5)

  • definition thmcounterdefinition: First principle
  • theorem thmcountertheorem
  • proof
  • definition thmcounterdefinition: Submodel
  • definition thmcounterdefinition: Second principle