Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Ekaterina Fadeeva; Aleksandr Rubashevskii; Artem Shelmanov; Sergey Petrakov; Haonan Li; Hamdy Mubarak; Evgenii Tsymbalov; Gleb Kuzmin; Alexander Panchenko; Timothy Baldwin; Preslav Nakov; Maxim Panov

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, Maxim Panov

TL;DR

This work tackles the challenge of factual inaccuracies in large language model outputs by introducing a token-level uncertainty quantification framework. It introduces Claim Conditioned Probability (CCP), a method that isolates claim uncertainty and leverages NLI-based verification over top-K token alternatives to produce robust claim-level scores. Through biography-generation experiments across seven LLMs and four languages, CCP consistently outperforms traditional baselines and remains competitive with fact-checking tools that rely on external knowledge (FactScore). The approach is computationally efficient and executable entirely on LLM outputs, offering practical benefits for improving reliability in real-world AI-assisted text generation.

Abstract

Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

TL;DR

Abstract

Paper Structure (34 sections, 10 equations, 12 figures, 11 tables)

This paper contains 34 sections, 10 equations, 12 figures, 11 tables.

Introduction
Related Work
Fact-Checking LLM Generations and Detecting Hallucinations
Uncertainty Quantification of LLM Generations
Fact-Checking Pipeline
Uncertainty Quantification
Claim-Level UQ Baselines
Claim-Conditioned Probability
Motivation and Theoretical Background
Implementation
Benchmark for Evaluation of Claim-Level UQ Methods
Experiments
Experimental Setup
Results for English on the FactScore Annotation
Multilingual Results on Manual Annotation
...and 19 more sections

Figures (12)

Figure 1: Visual comparison of our Claim-Conditioned Probability method to the Maximum Probability baseline. CCP accurately identifies the incorrectly specified number of awards (in red), whereas Maximum Probability erroneously highlights the claim that is actually correct.
Figure 2: Example of CCP calculation for the word painting in a Vicuna 13b generation.
Figure 3: ROC-AUC of claim-level UQ methods based on FactScore labels, aggregated into bins when considering only facts from the first $2$, $5$, and all sentences (English).
Figure 4: The scheme of the fact-checking pipeline based on UQ.
Figure 5: Example of the Vicuna 13b generation process and CCP calculation process part. The words from the greedy-generated sentence are presented sequentially on the top, each non-functional word is supplemented with its alternatives and autoregressive generation probabilities. Words with probability less than 0.1% are omitted. Green-colored words indicate entailment to the greedy generated word, red color indicates contradiction, and yellow color indicates neutral NLI class. On the last position, CCP successfully distinguishes Norway from other year-related words, and does not consider its probability in the final formula.
...and 7 more figures

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

TL;DR

Abstract

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Authors

TL;DR

Abstract

Table of Contents

Figures (12)