Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Matéo Mahaut; Laura Aina; Paula Czarnowska; Momchil Hardalov; Thomas Müller; Lluís Màrquez

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Matéo Mahaut, Laura Aina, Paula Czarnowska, Momchil Hardalov, Thomas Müller, Lluís Màrquez

TL;DR

This work tackles the challenge of reliably estimating LLM factual confidence by proposing a unified framework to compare estimators for both $P(T)$ (fact verification) and $P(IK)$ (question answering). It conducts a large empirical study across eight LLMs and five estimator families, finding that trained hidden-state probes deliver the most accurate confidence estimates when model weights are accessible, while word-prompted and surface-based methods lag, especially for non-instruction-tuned models. The authors also examine robustness by paraphrasing and translating inputs, revealing that LLMs can exhibit unstable confidence across meaning-preserving variations, and show partial cross-language generalization. The results highlight practical implications for deploying LLMs with uncertainty estimates and point to the need for reliable black-box estimators and methods that improve stability of parametric knowledge, accompanied by released code for reproducibility.

Abstract

Large Language Models (LLMs) tend to be unreliable in the factuality of their answers. To address this problem, NLP researchers have proposed a range of techniques to estimate LLM's confidence over facts. However, due to the lack of a systematic comparison, it is not clear how the different methods compare to one another. To fill this gap, we present a survey and empirical comparison of estimators of factual confidence. We define an experimental framework allowing for fair comparison, covering both fact-verification and question answering. Our experiments across a series of LLMs indicate that trained hidden-state probes provide the most reliable confidence estimates, albeit at the expense of requiring access to weights and training data. We also conduct a deeper assessment of factual confidence by measuring the consistency of model behavior under meaning-preserving variations in the input. We find that the confidence of LLMs is often unstable across semantically equivalent inputs, suggesting that there is much room for improvement of the stability of models' parametric knowledge. Our code is available at (https://github.com/amazon-science/factual-confidence-of-llms).

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

TL;DR

This work tackles the challenge of reliably estimating LLM factual confidence by proposing a unified framework to compare estimators for both

(fact verification) and

(question answering). It conducts a large empirical study across eight LLMs and five estimator families, finding that trained hidden-state probes deliver the most accurate confidence estimates when model weights are accessible, while word-prompted and surface-based methods lag, especially for non-instruction-tuned models. The authors also examine robustness by paraphrasing and translating inputs, revealing that LLMs can exhibit unstable confidence across meaning-preserving variations, and show partial cross-language generalization. The results highlight practical implications for deploying LLMs with uncertainty estimates and point to the need for reliable black-box estimators and methods that improve stability of parametric knowledge, accompanied by released code for reproducibility.

Abstract

Paper Structure (36 sections, 7 figures, 7 tables)

This paper contains 36 sections, 7 figures, 7 tables.

Introduction
Factual Confidence: Key Concepts
Definition of a Fact
Factual Confidence
Robustness of Factual Knowledge
Factual Confidence: Survey of Methods
Trained Probes
Sequence Probability
Verbalization
Surrogate Token Probability
Output Consistency
Methodology
Data
in Fact Verification: Lama T-REx
in QA: PopQA
...and 21 more sections

Figures (7)

Figure 1: Overview of our factual confidence estimation framework. We work with five groups of methods and two formulations: $P(\text{I know})$, which applies to questions, and $P(\text{True})$, which applies to statements. All of the methods produce a continuous score, except verbalization, where the model generates a confidence level.
Figure 2: AUPRC scores on T-REx with both true and false statements; .
Figure 3: AUPRC scores on PopQA dataset; .
Figure 4: Distribution of standard deviation scores for normalized $P(\text{T})$ on paraphrases of the same fact.
Figure 5: Variation in $P(\text{T})$ AUPRC when sampling paraphrases. 10 sets of paraphrases are randomly sampled, with one paraphrase for every question in Lama Lama T-RE.
...and 2 more figures

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

TL;DR

Abstract

Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators

Authors

TL;DR

Abstract

Table of Contents

Figures (7)