Table of Contents
Fetching ...

Calibrating Expressions of Certainty

Peiqi Wang, Barbara D. Lam, Yingcheng Liu, Ameneh Asgari-Targhi, Rameswar Panda, William M. Wells, Tina Kapur, Polina Golland

TL;DR

This paper reframes certainty in predictions as distributions over the probability simplex, enabling a generalized calibration framework beyond scalar confidence scores. It extends the expected calibration error to distributional outputs, derives robust estimators, and use discrete optimal transport to construct interpretable calibration maps between certainty phrases. The authors validate the approach on radiologists and language models, showing that OT-based post-hoc calibration improves ECE and Brier scores while yielding actionable guidance (e.g., substitute phrases) for humans. The work provides a practical, distribution-level method for improving the reliability of natural language expressions of certainty in both medical and AI systems, with broad implications for decision-making and trust.

Abstract

We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.

Calibrating Expressions of Certainty

TL;DR

This paper reframes certainty in predictions as distributions over the probability simplex, enabling a generalized calibration framework beyond scalar confidence scores. It extends the expected calibration error to distributional outputs, derives robust estimators, and use discrete optimal transport to construct interpretable calibration maps between certainty phrases. The authors validate the approach on radiologists and language models, showing that OT-based post-hoc calibration improves ECE and Brier scores while yielding actionable guidance (e.g., substitute phrases) for humans. The work provides a practical, distribution-level method for improving the reliability of natural language expressions of certainty in both medical and AI systems, with broad implications for decision-making and trust.

Abstract

We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration.
Paper Structure (61 sections, 1 theorem, 19 equations, 15 figures, 5 tables)

This paper contains 61 sections, 1 theorem, 19 equations, 15 figures, 5 tables.

Key Result

Proposition 1

The estimators $\hat{r}_m$, $\hat{p}_m$, and $\hat{g}_m$ defined in Equation eq:ece_per_bin_estimators_for_gX_dist are consistent estimators for $\mathbb{E}\left[ Y\mid S\in I_m \right]$, $P(S\in I_m)$, and $\mathbb{E}\left[ S\mid S\in I_m \right]$ respectively.

Figures (15)

  • Figure 1: Probability density functions obtained by fitting beta distributions to results of a survey of radiologists' perception of different certainty phrases shinagare2023diagnostic.
  • Figure 2: Reliability diagrams of radiologists' certainty phrase use in clinical reports, stratified by pathology (top) and radiologist identity (bottom). The calibration curve (red), with its 95% confidence interval (blue) and score density (gray) are shown. There is significant variation in calibration across different pathologies and radiologists. Areas where the calibration curve is above the identity line correspond to radiologists underestimating their confidence. Interestingly, this correlates with regions of low confidence. Same is true about overestimation in regions of high confidence.
  • Figure 3: Examples of calibrating radiologists on two representative pathologies: atelectasis and edema. The 1st and 4th columns show the reliability diagrams before and after the post-hoc calibration, respectively. The 2nd column displays the cost matrix $C$ of the optimal transport problem, while the 3rd column illustrates the probabilistic calibration map $T$. For atelectasis, underconfidence can be addressed by suggesting the use of "May" instead of "Present"; For edema, overconfidence can be mitigated by recommending that radiologists replace "Present" and "Likely" with "May". Quantitatively, our calibration approach improves ECE and Brier Score (BS) metrics.
  • Figure 4: Reliability diagrams of LMs verbalizing confidence from a fixed set of certainty phrases generated by prompting gpt-4o, evaluated on SciQ (top) and TruthfulQA (bottom). The calibration curve (red), with its 95% confidence interval (blue), and score density (gray) are shown. Models are better calibrated on SciQ than TruthfulQA, with larger models (e.g., gpt-4o) outperforming their smaller variants (e.g., gpt-4o-mini). The smooth calibration curve improves the ability of human viewers to compare the calibration performance of different models.
  • Figure 5: Relative frequency of diagnostic certainty phrases used in X-ray reports from the curated paired (X-ray, CT) dataset.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • proof