Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Mame Diarra Toure; David A. Stephens

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Mame Diarra Toure, David A. Stephens

TL;DR

Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

Abstract

In safety-critical classification, the cost of failure is often asymmetric, yet Bayesian deep learning summarises epistemic uncertainty with a single scalar, mutual information (MI), that cannot distinguish whether a model's ignorance involves a benign or safety-critical class. We decompose MI into a per-class vector $C_k(x)=σ_k^{2}/(2μ_k)$, with $μ_k{=}\mathbb{E}[p_k]$ and $σ_k^2{=}\mathrm{Var}[p_k]$ across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the $1/μ_k$ weighting corrects boundary suppression and makes $C_k$ comparable across rare and common classes. By construction $\sum_k C_k \approx \mathrm{MI}$, and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of $C_k$, we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class $C_k$ reduces selective risk by 34.7\% over MI and 56.2\% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where $\sum_k C_k$ achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which $\sum_k C_k$ shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

TL;DR

Abstract

, with

and

across posterior samples. The decomposition follows from a second-order Taylor expansion of the entropy; the

weighting corrects boundary suppression and makes

comparable across rare and common classes. By construction

, and a companion skewness diagnostic flags inputs where the approximation degrades. After characterising the axiomatic properties of

, we validate it on three tasks: (i) selective prediction for diabetic retinopathy, where critical-class

reduces selective risk by 34.7\% over MI and 56.2\% over variance baselines; (ii) out-of-distribution detection on clinical and image benchmarks, where

achieves the highest AUROC and the per-class view exposes asymmetric shifts invisible to MI; and (iii) a controlled label-noise study in which

shows less sensitivity to injected aleatoric noise than MI under end-to-end Bayesian training, while both metrics degrade under transfer learning. Across all tasks, the quality of the posterior approximation shapes uncertainty at least as strongly as the choice of metric, suggesting that how uncertainty is propagated through the network matters as much as how it is measured.

Paper Structure (147 sections, 7 theorems, 54 equations, 17 figures, 24 tables)

This paper contains 147 sections, 7 theorems, 54 equations, 17 figures, 24 tables.

Introduction
Contributions.
Per-Class Epistemic Uncertainty
Setup and Notation
From Scalar MI to a Per-Class Decomposition
Per-Class Epistemic Uncertainty Vector
Why Variance Alone Fails: Boundary Suppression
Axiomatic Analysis
Reliability Diagnostic via Skewness
Off-diagonal structure and the CBEC metric.
Selective Prediction for Diabetic Retinopathy
Experimental Setup
Data.
Model.
Inference and evaluation.
...and 132 more sections

Key Result

Lemma 2.1

The Shannon entropy $H(\boldsymbol{p})=-\sum_k p_k\log p_k$ satisfies where $\delta_{kj}$ is the Kronecker delta ($\delta_{kj}=1$ if $k=j$, $0$ otherwise).

Figures (17)

Figure 1: Selective prediction for DR. Left: Critical FNR vs. coverage; $C_{\text{crit\_max}}$ dominates all baselines across the full range. Right: Bootstrap AUSC distribution ($n{=}200$); $C_k$-based policies (red/orange) show non-overlapping interquartile ranges against scalar baselines (grey).
Figure 2: Epistemic signatures for Grade 3 errors with similar MI but distinct $C_k$ patterns. Left: Catastrophic miss ($3 \to 0$, MI$=$0.024) concentrates on $C_2$. Centre: Severity underestimate ($3 \to 2$, MI$=$0.027) elevates $C_0$. Right: Grouped comparison showing the $C_k$ fingerprints differ markedly despite similar MI, the catastrophic miss (red) peaks at $C_2$ while the severity error (dark red) peaks at $C_0$.
Figure 3: Distribution of skewness diagnostic $\rho_k$ by class. Safe classes (Grades 0--1) cluster near zero; critical classes (Grades 2--3) exhibit heavier tails, reflecting boundary suppression effects on rare-class posterior samples.
Figure 4: $\sum_k C_k$ vs. exact MI for all $7{,}948$ test samples (Pearson $r = 0.988$, Spearman $r = 0.998$). Left: Scatter plot; the near-perfect rank correlation confirms that the second-order approximation preserves the ordering of epistemic uncertainty. Right: Residuals coloured by maximum per-class $\rho_k$; positive residuals concentrate among high-skewness samples, as predicted by Lemma \ref{['lem:third-order']}.
Figure 5: Selective risk curves for all 10 deferral policies. Left: Critical FNR across coverage levels; $C_{\text{crit\_max}}$ achieves the lowest AUSC ($0.285$), dominating all baselines across the full coverage range. Right: Error rate; $C_{\text{crit\_max}}$ remains competitive (AUSC $= 0.143$), confirming that per-class targeting does not sacrifice overall accuracy.
...and 12 more figures

Theorems & Definitions (15)

Lemma 2.1: Entropy Derivatives
Theorem 2.2: MI Approximation
Definition 2.3: Per-Class Epistemic Uncertainty
Lemma 2.4: Variance Bound on the Simplex
Lemma 2.5: Boundary Behaviour of $C_k$
Theorem 2.6: Axiomatic Profile
Corollary 2.7: A5 Violation as Boundary Correction
Remark 2.8: A3 vs. Exact MI
Lemma 2.9: Third-Order Correction
Definition 2.10: Skewness Diagnostic
...and 5 more

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

TL;DR

Abstract

Not Just How Much, But Where: Decomposing Epistemic Uncertainty into Per-Class Contributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (15)