Bag of Coins: A Statistical Probe into Neural Confidence Structures
Agnideep Aich, Sameera Hewage, Md Monzur Murshed, Bruce Wade, Ashit Baran Aich
TL;DR
Bag-of-Coins (BoC) introduces a non-parametric diagnostic that probes the internal coherence of neural network logits by comparing the softmax confidence $\hat{p}$ to the average pairwise Luce probabilities $\bar{q}$, yielding a coherence gap $\Delta=\bar{q}-\hat{p}$ and a p-value-based structural score. Grounded in random utility theory, BoC assesses whether the external confidence aligns with internal logit geometry, treating $H_0:\bar{q}=\hat{p}$ as the null and deriving a finite-sample valid p-value via a binomial-tail bound. Experiments across ViT, ResNet, and RoBERTa show architecture-dependent coherence: ViT exhibits clear ID/OOD separation in $\Delta$ (ID $\sim$0.1–0.2 vs OOD $\sim$0.5–0.6), while ResNet and RoBERTa show substantial overlap, indicating weaker coherence-based signals. In calibration tasks BoC helps only when the base model is poorly calibrated and underperforms standard calibrators; for OOD detection BoC fails across architectures with AUROC far below established methods. Thus BoC is best viewed as a research diagnostic of logit-geometry uncertainty rather than a ready-to-deploy calibrator or OOD detector, with future work needed to design geometry-driven OOD scores and extend the probe to broader modalities and architectures.
Abstract
Modern neural networks often produce miscalibrated confidence scores and struggle to detect out-of-distribution (OOD) inputs, while most existing methods post-process outputs without testing internal consistency. We introduce the Bag-of-Coins (BoC) probe, a non-parametric diagnostic of logit coherence that compares softmax confidence $\hat p$ to an aggregate of pairwise Luce-style dominance probabilities $\bar q$, yielding a deterministic coherence score and a p-value-based structural score. Across ViT, ResNet, and RoBERTa with ID/OOD test sets, the coherence gap $Δ=\bar q-\hat p$ reveals clear ID/OOD separation for ViT (ID ${\sim}0.1$-$0.2$, OOD ${\sim}0.5$-$0.6$) but substantial overlap for ResNet and RoBERTa (both ${\sim}0$), indicating architecture-dependent uncertainty geometry. As a practical method, BoC improves calibration only when the base model is poorly calibrated (ViT: ECE $0.024$ vs.\ $0.180$) and underperforms standard calibrators (ECE ${\sim}0.005$), while for OOD detection it fails across architectures (AUROC $0.020$-$0.253$) compared to standard scores ($0.75$-$0.99$). We position BoC as a research diagnostic for interrogating how architectures encode uncertainty in logit geometry rather than a production calibration or OOD detection method.
