Table of Contents
Fetching ...

Calibration Error Estimation Using Fuzzy Binning

Geetanjali Bihani, Julia Taylor Rayz

TL;DR

This work tackles miscalibration in neural-network predictions and the biases introduced by crisp binning in calibration error metrics. It proposes Fuzzy Calibration Error (FCE), which uses trapezoidal fuzzy binning to map prediction probabilities to soft bin memberships, enabling edge cases to contribute and reducing skew in calibration estimates, with a formal definition $FCE=\frac{1}{\sum_{m=1}^{M}\mu_{fuzzy}(B_{m})}\sum_{m=1}^{M} |\mu(B_{m})| \cdot |acc_{fuzzy}(B_{m})-conf_{fuzzy}(B_{m})|$. The authors empirically compare FCE to ECE across NLP-style datasets (e.g., AGNews, 20NG, IMDb) and varying bin counts, demonstrating that FCE yields tighter, more robust calibration estimates, particularly in multi-class settings. The work contributes a new calibration metric, demonstrates its practical advantages, and provides code to enable broader use and validation in high-stakes decision systems.

Abstract

Neural network-based decisions tend to be overconfident, where their raw outcome probabilities do not align with the true decision probabilities. Calibration of neural networks is an essential step towards more reliable deep learning frameworks. Prior metrics of calibration error primarily utilize crisp bin membership-based measures. This exacerbates skew in model probabilities and portrays an incomplete picture of calibration error. In this work, we propose a Fuzzy Calibration Error metric (FCE) that utilizes a fuzzy binning approach to calculate calibration error. This approach alleviates the impact of probability skew and provides a tighter estimate while measuring calibration error. We compare our metric with ECE across different data populations and class memberships. Our results show that FCE offers better calibration error estimation, especially in multi-class settings, alleviating the effects of skew in model confidence scores on calibration error estimation. We make our code and supplementary materials available at: https://github.com/bihani-g/fce

Calibration Error Estimation Using Fuzzy Binning

TL;DR

This work tackles miscalibration in neural-network predictions and the biases introduced by crisp binning in calibration error metrics. It proposes Fuzzy Calibration Error (FCE), which uses trapezoidal fuzzy binning to map prediction probabilities to soft bin memberships, enabling edge cases to contribute and reducing skew in calibration estimates, with a formal definition . The authors empirically compare FCE to ECE across NLP-style datasets (e.g., AGNews, 20NG, IMDb) and varying bin counts, demonstrating that FCE yields tighter, more robust calibration estimates, particularly in multi-class settings. The work contributes a new calibration metric, demonstrates its practical advantages, and provides code to enable broader use and validation in high-stakes decision systems.

Abstract

Neural network-based decisions tend to be overconfident, where their raw outcome probabilities do not align with the true decision probabilities. Calibration of neural networks is an essential step towards more reliable deep learning frameworks. Prior metrics of calibration error primarily utilize crisp bin membership-based measures. This exacerbates skew in model probabilities and portrays an incomplete picture of calibration error. In this work, we propose a Fuzzy Calibration Error metric (FCE) that utilizes a fuzzy binning approach to calculate calibration error. This approach alleviates the impact of probability skew and provides a tighter estimate while measuring calibration error. We compare our metric with ECE across different data populations and class memberships. Our results show that FCE offers better calibration error estimation, especially in multi-class settings, alleviating the effects of skew in model confidence scores on calibration error estimation. We make our code and supplementary materials available at: https://github.com/bihani-g/fce
Paper Structure (9 sections, 6 equations, 4 figures, 1 table)

This paper contains 9 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Crisp binning (Top left) and fuzzy binning (Bottom left) of prediction probabilities, where the number of bins $M=3$. An example of the difference in bin assignment based on $\hat{p_{i}}$ in crisp vs fuzzy binning (Right).
  • Figure 2: Variation in calibration error estimated using ECE and FCE across different bin sizes (top to bottom) and class distributions (left vs right)
  • Figure 3: Variation in model overconfidence (OF) across different sample sizes
  • Figure 4: Binning of prediction probabilities across $M=15$ bins (model fine-tuned on $n=5000$ samples)