Table of Contents
Fetching ...

Geometric Calibration and Neutral Zones for Uncertainty-Aware Multi-Class Classification

Soumojit Das, Nairanjana Dasgupta, Prashanta Dutta

TL;DR

<3-5 sentence high-level summary>

Abstract

Modern artificial intelligence systems make critical decisions yet often fail silently when uncertain -- even well-calibrated models provide no mechanism to identify \textit{which specific predictions} are unreliable. We develop a geometric framework addressing both calibration and instance-level uncertainty quantification for neural network probability outputs. Treating probability vectors as points on the $(c-1)$-dimensional probability simplex equipped with the Fisher--Rao metric, we construct: (i) Additive Log-Ratio (ALR) calibration maps that reduce exactly to Platt scaling for binary problems while extending naturally to multi-class settings, and (ii) geometric reliability scores that translate calibrated probabilities into actionable uncertainty measures, enabling principled deferral of ambiguous predictions to human review. Theoretical contributions include: consistency of the calibration estimator at rate $O_p(n^{-1/2})$ via M-estimation theory (Theorem~1), and tight concentration bounds for reliability scores with explicit sub-Gaussian parameters enabling sample size calculations for validation set design (Theorem~2). We conjecture Neyman--Pearson optimality of our neutral zone construction based on connections to Bhattacharyya coefficients. Empirical validation on Adeno-Associated Virus classification demonstrates that the two-stage framework captures 72.5\% of errors while deferring 34.5\% of samples, reducing automated decision error rates from 16.8\% to 6.9\%. Notably, calibration alone yields marginal accuracy gains; the operational benefit arises primarily from the reliability scoring mechanism, which applies to any well-calibrated probability output. This work bridges information geometry and statistical learning, offering formal guarantees for uncertainty-aware classification in applications requiring rigorous validation.

Geometric Calibration and Neutral Zones for Uncertainty-Aware Multi-Class Classification

TL;DR

<3-5 sentence high-level summary>

Abstract

Modern artificial intelligence systems make critical decisions yet often fail silently when uncertain -- even well-calibrated models provide no mechanism to identify \textit{which specific predictions} are unreliable. We develop a geometric framework addressing both calibration and instance-level uncertainty quantification for neural network probability outputs. Treating probability vectors as points on the -dimensional probability simplex equipped with the Fisher--Rao metric, we construct: (i) Additive Log-Ratio (ALR) calibration maps that reduce exactly to Platt scaling for binary problems while extending naturally to multi-class settings, and (ii) geometric reliability scores that translate calibrated probabilities into actionable uncertainty measures, enabling principled deferral of ambiguous predictions to human review. Theoretical contributions include: consistency of the calibration estimator at rate via M-estimation theory (Theorem~1), and tight concentration bounds for reliability scores with explicit sub-Gaussian parameters enabling sample size calculations for validation set design (Theorem~2). We conjecture Neyman--Pearson optimality of our neutral zone construction based on connections to Bhattacharyya coefficients. Empirical validation on Adeno-Associated Virus classification demonstrates that the two-stage framework captures 72.5\% of errors while deferring 34.5\% of samples, reducing automated decision error rates from 16.8\% to 6.9\%. Notably, calibration alone yields marginal accuracy gains; the operational benefit arises primarily from the reliability scoring mechanism, which applies to any well-calibrated probability output. This work bridges information geometry and statistical learning, offering formal guarantees for uncertainty-aware classification in applications requiring rigorous validation.

Paper Structure

This paper contains 71 sections, 6 theorems, 58 equations, 6 figures, 5 tables.

Key Result

Proposition 1

For $c = 2$, geometric calibration reduces to Platt scaling. Specifically, if $\mathop{\mathrm{ALR}}\nolimits(p) = \log\frac{p}{1-p} = \mathop{\mathrm{logit}}\nolimits(p)$, then the calibration map becomes: where $\sigma$ is the logistic function and $(a, b) \in \mathbb{R}^2$ are learned parameters.

Figures (6)

  • Figure 1: Finite-sample convergence of geometric calibration.(Left) Calibration discrepancy $\|\hat{T}_n - \hat{T}_{1241}\|$ versus sample size $n \in \{100, 250, 500, 750, 1000\}$ with 1,000 bootstrap iterations per size. Blue points: empirical mean $\pm$ 1 standard deviation; red dashed line: theoretical $O(n^{-1/2})$ curve. Error decreases faster than theoretical prediction. (Right) Log-log plot reveals linear relationship with slope $\alpha = -0.82$ (95% CI: $[-1.11, -0.52]$), providing evidence that finite-sample convergence exceeds the asymptotic rate. The confidence interval excludes $\alpha = -0.5$, though the lower bound $-0.52$ is close to this threshold.
  • Figure 2: Confusion matrices for AAV cargo classification ($n_{\text{val}} = 310$). (Left) Uncalibrated CNN: 83.2% accuracy, 52 errors. (Center) After geometric calibration: 83.5% accuracy, 51 errors. (Right) Automated decisions outside neutral zone ($n=203$): 93.1% accuracy, 14 errors. Calibration alone yields marginal accuracy improvement; the substantial gain (6.9% error rate vs. 16.8% baseline) comes from uncertainty-aware deferral via neutral zones.
  • Figure 3: Per-class reliability diagrams ($n_{\text{val}} = 310$, adaptive binning). Each panel plots mean predicted probability (x-axis) against empirical frequency (y-axis); perfect calibration follows the dashed diagonal. LOESS curves track the diagonal more closely after calibration, particularly for minority class ssDNA.
  • Figure 4: Reliability score distributions ($\lambda = 1.0$). Correct predictions (blue): mean $= 0.579$. Errors (red): mean $= 0.371$. Separation $\Delta = 0.208$ (effect size $d = 1.17$). Dashed line: threshold $\tau^* = 0.445$.
  • Figure 5: Error detection via reliability scores.(Left) ROC curve with AUC $= 0.801$. Operating point at $\tau^* = 0.445$: 72.5% true positive rate (error capture) at 27.0% false positive rate. (Right) Precision-recall curve. At operating point: 34.6% precision, 72.5% recall. Horizontal line: baseline 16.5% error rate. The $2.1\times$ improvement over random deferral demonstrates effective error identification.
  • ...and 1 more figures

Theorems & Definitions (43)

  • Definition 1: Fisher-Rao Metric
  • Remark 1: Geometric Foundations
  • Definition 2: Additive Log-Ratio Transform
  • Remark 2: Coordinate Choice
  • Remark 3: Model Specification
  • Remark 4: Regularization and Prior Specification
  • Definition 3: Reliability Score
  • Remark 5: Decision-Theoretic Interpretation
  • Definition 4: Neutral Zone
  • Remark 6: Information-Theoretic Grounding
  • ...and 33 more