Table of Contents
Fetching ...

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Mame Diarra Toure, David A. Stephens

TL;DR

Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

Abstract

In safety-critical classification, the cost of failure is often asymmetric, yet Bayesian deep learning summarises epistemic uncertainty with a single scalar, mutual information (MI), that cannot distinguish whether a model's ignorance involves a benign or safety-critical class. We decompose MI into a per-class vector $C_k(x)=σ_k^{2}/(2μ_k)$, with $μ_k{=}\mathbb{E}[p_k]$ and $σ_k^2{=}\mathrm{Var}[p_k]$ across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the $1/μ_k$ weighting corrects boundary suppression and makes $C_k$ comparable across rare and common classes. By construction $\sum_k C_k \approx \mathrm{MI}$, and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of $C_k$, we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class $C_k$ reduces selective risk by 34.7\% over MI and 56.2\% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where $\sum_k C_k$ achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which $\sum_k C_k$ shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

TL;DR

Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

Abstract

In safety-critical classification, the cost of failure is often asymmetric, yet Bayesian deep learning summarises epistemic uncertainty with a single scalar, mutual information (MI), that cannot distinguish whether a model's ignorance involves a benign or safety-critical class. We decompose MI into a per-class vector , with and across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the weighting corrects boundary suppression and makes comparable across rare and common classes. By construction , and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of , we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class reduces selective risk by 34.7\% over MI and 56.2\% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.
Paper Structure (147 sections, 7 theorems, 54 equations, 17 figures, 24 tables)

This paper contains 147 sections, 7 theorems, 54 equations, 17 figures, 24 tables.

Key Result

Lemma 2.1

The Shannon entropy $H(\boldsymbol{p})=-\sum_k p_k\log p_k$ satisfies where $\delta_{kj}$ is the Kronecker delta ($\delta_{kj}=1$ if $k=j$, $0$ otherwise).

Figures (17)

  • Figure 1: Selective prediction for DR. Left: Critical FNR vs. coverage; $C_{\text{crit\_max}}$ dominates all baselines across the full range. Right: Bootstrap AUSC distribution ($n{=}200$); $C_k$-based policies (red/orange) show non-overlapping interquartile ranges against scalar baselines (grey).
  • Figure 2: Epistemic signatures for Grade 3 errors with similar MI but distinct $C_k$ patterns. Left: Catastrophic miss ($3 \to 0$, MI$=$0.024) concentrates on $C_2$. Centre: Severity underestimate ($3 \to 2$, MI$=$0.027) elevates $C_0$. Right: Grouped comparison showing the $C_k$ fingerprints differ markedly despite similar MI, the catastrophic miss (red) peaks at $C_2$ while the severity error (dark red) peaks at $C_0$.
  • Figure 3: Distribution of skewness diagnostic $\rho_k$ by class. Safe classes (Grades 0--1) cluster near zero; critical classes (Grades 2--3) exhibit heavier tails, reflecting boundary suppression effects on rare-class posterior samples.
  • Figure 4: $\sum_k C_k$ vs. exact MI for all $7{,}948$ test samples (Pearson $r = 0.988$, Spearman $r = 0.998$). Left: Scatter plot; the near-perfect rank correlation confirms that the second-order approximation preserves the ordering of epistemic uncertainty. Right: Residuals coloured by maximum per-class $\rho_k$; positive residuals concentrate among high-skewness samples, as predicted by Lemma \ref{['lem:third-order']}.
  • Figure 5: Selective risk curves for all 10 deferral policies. Left: Critical FNR across coverage levels; $C_{\text{crit\_max}}$ achieves the lowest AUSC ($0.285$), dominating all baselines across the full coverage range. Right: Error rate; $C_{\text{crit\_max}}$ remains competitive (AUSC $= 0.143$), confirming that per-class targeting does not sacrifice overall accuracy.
  • ...and 12 more figures

Theorems & Definitions (15)

  • Lemma 2.1: Entropy Derivatives
  • Theorem 2.2: MI Approximation
  • Definition 2.3: Per-Class Epistemic Uncertainty
  • Lemma 2.4: Variance Bound on the Simplex
  • Lemma 2.5: Boundary Behaviour of $C_k$
  • Theorem 2.6: Axiomatic Profile
  • Corollary 2.7: A5 Violation as Boundary Correction
  • Remark 2.8: A3 vs. Exact MI
  • Lemma 2.9: Third-Order Correction
  • Definition 2.10: Skewness Diagnostic
  • ...and 5 more