Table of Contents
Fetching ...

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling

Jinzong Dong, Zhaohui Jiang, Dong Pan, Haoyang Yu

TL;DR

The paper tackles confidence calibration by explicitly incorporating a principled prior behind the calibration curve through a binomial-process model. It proposes a Beta-prior-based calibration map $g(\hat{S};\alpha,\beta,c)$, solved via a convex-equivalent maximum-likelihood objective, and proves Lipschitz continuity and improved sample efficiency with only $3$ representative bins needed. A new calibration metric, $TCE_{bpm}$, is defined and shown to be a consistent calibration measure, supported by theoretical guarantees in continuity, consistency, and sample efficiency. The authors additionally introduce a binomial-process-based data-simulation method to generate realistic calibration datasets for benchmarking calibration metrics against the true calibration error. Empirically, the method yields calibration curves that align closely with true calibration in simulated data and outperform competing metrics on real datasets, highlighting practical benefits for safety-critical and underrepresented-population scenarios.

Abstract

Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of $3/B$ of that required for histogram binning, where $B$ represents the number of bins. Also, a new calibration metric ($TCE_{bpm}$), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. $TCE_{bpm}$ is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.

Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling

TL;DR

The paper tackles confidence calibration by explicitly incorporating a principled prior behind the calibration curve through a binomial-process model. It proposes a Beta-prior-based calibration map , solved via a convex-equivalent maximum-likelihood objective, and proves Lipschitz continuity and improved sample efficiency with only representative bins needed. A new calibration metric, , is defined and shown to be a consistent calibration measure, supported by theoretical guarantees in continuity, consistency, and sample efficiency. The authors additionally introduce a binomial-process-based data-simulation method to generate realistic calibration datasets for benchmarking calibration metrics against the true calibration error. Empirically, the method yields calibration curves that align closely with true calibration in simulated data and outperform competing metrics on real datasets, highlighting practical benefits for safety-critical and underrepresented-population scenarios.

Abstract

Confidence calibration of classification models is a technique to estimate the true posterior probability of the predicted class, which is critical for ensuring reliable decision-making in practical applications. Existing confidence calibration methods mostly use statistical techniques to estimate the calibration curve from data or fit a user-defined calibration function, but often overlook fully mining and utilizing the prior distribution behind the calibration curve. However, a well-informed prior distribution can provide valuable insights beyond the empirical data under the limited data or low-density regions of confidence scores. To fill this gap, this paper proposes a new method that integrates the prior distribution behind the calibration curve with empirical data to estimate a continuous calibration curve, which is realized by modeling the sampling process of calibration data as a binomial process and maximizing the likelihood function of the binomial process. We prove that the calibration curve estimating method is Lipschitz continuous with respect to data distribution and requires a sample size of of that required for histogram binning, where represents the number of bins. Also, a new calibration metric (), which leverages the estimated calibration curve to estimate the true calibration error (TCE), is designed. is proven to be a consistent calibration measure. Furthermore, realistic calibration datasets can be generated by the binomial process modeling from a preset true calibration curve and confidence score distribution, which can serve as a benchmark to measure and compare the discrepancy between existing calibration metrics and the true calibration error. The effectiveness of our calibration method and metric are verified in real-world and simulated data.

Paper Structure

This paper contains 46 sections, 6 theorems, 54 equations, 5 figures, 5 tables, 3 algorithms.

Key Result

Theorem 1

For two distribution $D_{1},D_{2}$ over $[0,1] \times \{ 0,1\}$, let $\Gamma$ be the family of all couplings of distributions $D_{1}$ and $D_{2}$, and $g({{\hat{S}}};{\theta _{{D}}})$ represents the calibration curve learned from $D$ via Eq. equivalence, then $\forall \gamma \in \Gamma$, it holds th where $L \ge 0$. Therefore:

Figures (5)

  • Figure 1: Experimental results of our method. HB represents Histogram binning zadrozny2001obtaining. In (a), the estimated calibration curve on real data aligns well with histogram binning results from various binning schemes and closely matches the mean result. In (b), the calibration curve estimated by our method closely approximates the true calibration curve in simulated data. In (c), our calibration metric is closest to the true calibration error (TCE) in many times (e.g., when the number of samples is 1500, 2000, 2500, 3000, 3500, and 5000).
  • Figure 2: Visualization of the selected true calibration curve.
  • Figure 3: Visualization of the estimated calibration curve on the public logit dataset.
  • Figure 4: Visualization of estimated results of calibration curves.
  • Figure 5: Calibration metrics comparison in all confidence scores.

Theorems & Definitions (17)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Definition 3
  • Definition 4
  • Theorem 3
  • Corollary 1
  • Theorem 4
  • proof
  • ...and 7 more