Table of Contents
Fetching ...

Bag of Coins: A Statistical Probe into Neural Confidence Structures

Agnideep Aich, Sameera Hewage, Md Monzur Murshed, Bruce Wade, Ashit Baran Aich

TL;DR

Bag-of-Coins (BoC) introduces a non-parametric diagnostic that probes the internal coherence of neural network logits by comparing the softmax confidence $\hat{p}$ to the average pairwise Luce probabilities $\bar{q}$, yielding a coherence gap $\Delta=\bar{q}-\hat{p}$ and a p-value-based structural score. Grounded in random utility theory, BoC assesses whether the external confidence aligns with internal logit geometry, treating $H_0:\bar{q}=\hat{p}$ as the null and deriving a finite-sample valid p-value via a binomial-tail bound. Experiments across ViT, ResNet, and RoBERTa show architecture-dependent coherence: ViT exhibits clear ID/OOD separation in $\Delta$ (ID $\sim$0.1–0.2 vs OOD $\sim$0.5–0.6), while ResNet and RoBERTa show substantial overlap, indicating weaker coherence-based signals. In calibration tasks BoC helps only when the base model is poorly calibrated and underperforms standard calibrators; for OOD detection BoC fails across architectures with AUROC far below established methods. Thus BoC is best viewed as a research diagnostic of logit-geometry uncertainty rather than a ready-to-deploy calibrator or OOD detector, with future work needed to design geometry-driven OOD scores and extend the probe to broader modalities and architectures.

Abstract

Modern neural networks often produce miscalibrated confidence scores and struggle to detect out-of-distribution (OOD) inputs, while most existing methods post-process outputs without testing internal consistency. We introduce the Bag-of-Coins (BoC) probe, a non-parametric diagnostic of logit coherence that compares softmax confidence $\hat p$ to an aggregate of pairwise Luce-style dominance probabilities $\bar q$, yielding a deterministic coherence score and a p-value-based structural score. Across ViT, ResNet, and RoBERTa with ID/OOD test sets, the coherence gap $Δ=\bar q-\hat p$ reveals clear ID/OOD separation for ViT (ID ${\sim}0.1$-$0.2$, OOD ${\sim}0.5$-$0.6$) but substantial overlap for ResNet and RoBERTa (both ${\sim}0$), indicating architecture-dependent uncertainty geometry. As a practical method, BoC improves calibration only when the base model is poorly calibrated (ViT: ECE $0.024$ vs.\ $0.180$) and underperforms standard calibrators (ECE ${\sim}0.005$), while for OOD detection it fails across architectures (AUROC $0.020$-$0.253$) compared to standard scores ($0.75$-$0.99$). We position BoC as a research diagnostic for interrogating how architectures encode uncertainty in logit geometry rather than a production calibration or OOD detection method.

Bag of Coins: A Statistical Probe into Neural Confidence Structures

TL;DR

Bag-of-Coins (BoC) introduces a non-parametric diagnostic that probes the internal coherence of neural network logits by comparing the softmax confidence to the average pairwise Luce probabilities , yielding a coherence gap and a p-value-based structural score. Grounded in random utility theory, BoC assesses whether the external confidence aligns with internal logit geometry, treating as the null and deriving a finite-sample valid p-value via a binomial-tail bound. Experiments across ViT, ResNet, and RoBERTa show architecture-dependent coherence: ViT exhibits clear ID/OOD separation in (ID 0.1–0.2 vs OOD 0.5–0.6), while ResNet and RoBERTa show substantial overlap, indicating weaker coherence-based signals. In calibration tasks BoC helps only when the base model is poorly calibrated and underperforms standard calibrators; for OOD detection BoC fails across architectures with AUROC far below established methods. Thus BoC is best viewed as a research diagnostic of logit-geometry uncertainty rather than a ready-to-deploy calibrator or OOD detector, with future work needed to design geometry-driven OOD scores and extend the probe to broader modalities and architectures.

Abstract

Modern neural networks often produce miscalibrated confidence scores and struggle to detect out-of-distribution (OOD) inputs, while most existing methods post-process outputs without testing internal consistency. We introduce the Bag-of-Coins (BoC) probe, a non-parametric diagnostic of logit coherence that compares softmax confidence to an aggregate of pairwise Luce-style dominance probabilities , yielding a deterministic coherence score and a p-value-based structural score. Across ViT, ResNet, and RoBERTa with ID/OOD test sets, the coherence gap reveals clear ID/OOD separation for ViT (ID -, OOD -) but substantial overlap for ResNet and RoBERTa (both ), indicating architecture-dependent uncertainty geometry. As a practical method, BoC improves calibration only when the base model is poorly calibrated (ViT: ECE vs.\ ) and underperforms standard calibrators (ECE ), while for OOD detection it fails across architectures (AUROC -) compared to standard scores (-). We position BoC as a research diagnostic for interrogating how architectures encode uncertainty in logit geometry rather than a production calibration or OOD detection method.

Paper Structure

This paper contains 25 sections, 3 theorems, 18 equations, 9 figures, 6 tables, 2 algorithms.

Key Result

Lemma 6.1

Let $X=\sum_{i=1}^k Y_i$ where $\{Y_i\}$ are independent $\mathrm{Bernoulli}(p_i)$ with mean $\mu=\sum_{i=1}^k p_i$. Let $Z\sim \mathrm{Binomial}(k,\mu/k)$. Then for all integers $t$,

Figures (9)

  • Figure 1: ViT reliability diagram with bootstrap 95% confidence intervals (left) and OOD ROC curves (right) for CIFAR-10 (ID) vs. SVHN (OOD).
  • Figure 2: ViT BoC coherence diagnostics: histogram of $\Delta=\bar{q}-\hat{p}$ showing clear separation between ID (blue, centered around ${\sim}0.1$--$0.2$) and OOD (orange, centered around ${\sim}0.5$--$0.6$) distributions (left), and mean $\Delta$ versus confidence on ID data showing a strong negative relationship (right).
  • Figure 3: ViT BoC sensitivity to trial count $k$: ECE versus $k$ (left) and OOD AUROC versus $k$ (right), comparing deterministic (solid) and Monte--Carlo (dashed) variants.
  • Figure 4: ResNet reliability diagram with bootstrap 95% confidence intervals (left) and OOD ROC curves (right) for CIFAR-10 (ID) vs. SVHN (OOD).
  • Figure 5: ResNet BoC coherence diagnostics: histogram of $\Delta=\bar{q}-\hat{p}$ showing ID sharply concentrated near $0$ and OOD having a broad heavy-tailed distribution with overlap near zero (left), and mean $\Delta$ versus confidence on ID data showing a strong decreasing trend (right).
  • ...and 4 more figures

Theorems & Definitions (8)

  • Definition 4.1: Expected Calibration Error (ECE)
  • Definition 5.1: The BoC Probe for Logit Coherence
  • Lemma 6.1: Binomial tail upper-bounds Poisson--binomial tail
  • Proposition 6.2: Finite-sample $p$-value validity
  • Corollary 6.3: Concentration and choice of $k$
  • proof
  • proof
  • proof