Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Gal Yona; Roee Aharoni; Mor Geva

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Gal Yona, Roee Aharoni, Mor Geva

TL;DR

This work defines faithful response uncertainty as the requirement that an LLM's expressed hedging aligns with its intrinsic confidence for each assertion in its answer. It introduces a formal faithfulness metric built from decisiveness and consistency-based confidence, and validates a judge-based implementation using a Gemini Ultra oracle. Through knowledge-intensive QA benchmarks (PopQA and Natural Questions), the study shows modern LLMs predominantly produce decisive, unhedged answers even when uncertain, with prompting only weakly improving faithfulness. The findings underscore a need for stronger alignment techniques to enable trustworthy uncertainty communication in practical AI systems.

Abstract

We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty by hedging its answer (e.g., "I'm not sure, but I think..."). We formalize faithful response uncertainty based on the gap between the model's intrinsic confidence in the assertions it makes and the decisiveness by which they are conveyed. This example-level metric reliably indicates whether the model reflects its uncertainty, as it penalizes both excessive and insufficient hedging. We evaluate a variety of aligned LLMs at faithfully communicating uncertainty on several knowledge-intensive question answering tasks. Our results provide strong evidence that modern LLMs are poor at faithfully conveying their uncertainty, and that better alignment is necessary to improve their trustworthiness.

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

TL;DR

Abstract

Paper Structure (31 sections, 3 equations, 5 figures, 6 tables)

This paper contains 31 sections, 3 equations, 5 figures, 6 tables.

Introduction
Faithful Response Uncertainty
Measuring Decisiveness & Uncertainty
Quantifying Decisiveness
Quantifying Uncertainty
Implementation Details
Correlation with Human Judgement
Experimental Setting
Data
Models
Methods
Evaluation
Results
Without special instructions, models generate decisive answers, even for uncertain answers
State-of-the-art models cannot be easily steered towards faithfully expressing uncertainty via prompting.
...and 16 more sections

Figures (5)

Figure 1: We define faithful response uncertainty based on the gap between the decisiveness (blue) of the response and the model's intrinsic confidence in it (hatched orange). We empirically show: (1) with standard decoding, models answer decisively even in the presence of uncertainty (top left); (2) when prompted to express uncertainty, generated hedges are not faithful to the model's intrinsic uncertainty (bottom left).
Figure 2: Our mean decisiveness score ($\star$) vs. IQR of human perceptions of probability (blue bars), obtained by fagen2023perception. The LLM-based outputs generally agree with the human judgements.
Figure 3: Standard decoding yields decisive answers, even under uncertainty: We show results for standard decoding on PopQA (left) and NQ (right). Models (x-axis) are sorted by Accuracy (blue), and the additional bars show Confidence (orange) and Decisiveness (green). We see: (1) More accurate models generally tend to have higher confidence. (2) Even the best models have some significant uncertainty (e.g. on the challenging PopQA benchmark, the high confidence is 0.8). (3) All the models answer decisively, regardless of their uncertainty.
Figure 4: Weak correlation between decisiveness and confidence: We plot decisiveness (y-axis) vs confidence (x-axis) for two of the best performing (model, method, dataset) combinations (see Table \ref{['table:cmfg-results']}). We see that these methods succeed at slightly improving $\CMFG$ (beyond the $0.5$ baseline) by inducing some non-decisive answers, but the correlation between decisiveness and confidence is weak.
Figure 5: Prompting models to express uncertainty can slightly reduce the mean decisiveness: We plot the mean decisiveness (y-axis) vs mean confidence (x-axis) for all the large models we tested (Gemini Pro and Gemini Ultra, and the two GPT variants). We see that only Uncertainty and Uncertainty+ are capable of inducing hedging expressions, thus reducing the mean decisiveness.

Theorems & Definitions (1)

definition 1: Faithful Response Uncertainty

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

TL;DR

Abstract

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)