Table of Contents
Fetching ...

Calibrating Verbalized Confidence with Self-Generated Distractors

Victor Wang, Elias Stengel-Eskin

TL;DR

This work addresses miscalibration and saturation in verbalized LLM confidence by introducing Distractor-Normalized Coherence (DiNCo). DiNCo normalizes a main claim's confidence by the total confidence across self-generated distractors and uses NLI to downweight redundancy, then blends this validator signal with a generator-coherence signal via self-consistency. It demonstrates improved calibration (ECE, Brier score, AUC) and reduced saturation across short-form QA and long-form biography generation, outperforming strong baselines in zero-resource settings. The approach has practical implications for trustworthy AI, enabling more reliable, interpretable, and decision-ready model outputs without task-specific tuning.

Abstract

Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM's heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM's suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated -- and therefore more usable -- confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.

Calibrating Verbalized Confidence with Self-Generated Distractors

TL;DR

This work addresses miscalibration and saturation in verbalized LLM confidence by introducing Distractor-Normalized Coherence (DiNCo). DiNCo normalizes a main claim's confidence by the total confidence across self-generated distractors and uses NLI to downweight redundancy, then blends this validator signal with a generator-coherence signal via self-consistency. It demonstrates improved calibration (ECE, Brier score, AUC) and reduced saturation across short-form QA and long-form biography generation, outperforming strong baselines in zero-resource settings. The approach has practical implications for trustworthy AI, enabling more reliable, interpretable, and decision-ready model outputs without task-specific tuning.

Abstract

Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM's heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM's suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated -- and therefore more usable -- confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.

Paper Structure

This paper contains 41 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Calibration metrics (Expected Calibration Error $\downarrow$, Brier score $\downarrow$, area under the ROC curve $\uparrow$; see Appendix \ref{['app:evaluation-metrics']}) with Qwen3-8B on TriviaQA using ${\rm P(True)}$ as verbalized confidence. (Left) Verbalized confidence is saturated at high confidence. For each bar, we label the number of instances whose confidence falls in the interval and we darken larger bins. (Center)DiNCo normalizes by the total confidence over candidate answers, relieving saturation and improving calibration. (Right) Since verbalized confidence is saturated at high confidence, it is unable to achieve an acceptable true positive rate (TPR) without incurring a significant false positive rate (FPR) of 0.24. In other words, no rejection threshold can be chosen to reject a high proportion of false claims. Meanwhile, DiNCo enjoys better granularity, ranking positives above negatives even among instances with a verbalized confidence of 1.
  • Figure 2: Normalizing verbalized confidence with DiNCo. (Left) The LLM generates a claim along with several distractors and reports its confidences on them independently. To calibrate the main claim's confidence, we divide it by $\beta$, the sum over each distractor's confidence, weighted by uniqueness (center) and counterfactuality (right). Details in \ref{['sec:method']}.
  • Figure 3: Distributions of total confidence over correct and incorrect answers.
  • Figure 4: Scaling self-consistency does not close the gap with DiNCo.
  • Figure 4: Saturation analysis (higher $\Delta$ = lower saturation). DiNCo alleviates saturation.