Table of Contents
Fetching ...

Real-Time Detection of Hallucinated Entities in Long-Form Generation

Oscar Obeso, Andy Arditi, Javier Ferrando, Joshua Freeman, Cameron Holmes, Neel Nanda

TL;DR

This work proposes a streaming, token-level approach to detect entity-level hallucinations in long-form LLM outputs by labeling entity spans and training lightweight probes. It leverages a web-augmented frontier LLM to create finely annotated token-level data (LongFact++), enabling accurate, real-time detection with simple linear or LoRA-enhanced probes. The method outperforms uncertainty-based baselines across long-form and short-form tasks and generalizes across models, including cross-model transfer and even math reasoning tasks. Additionally, KL regularization provides a practical trade-off between detection performance and preserving original model behavior, and the framework enables selective answering to improve reliability in high-stakes settings. While promising, the approach acknowledges annotation noise and current limitations in recall at realistic false-positive rates, outlining clear directions for future deployment and extension beyond entity-level hallucinations.

Abstract

Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

Real-Time Detection of Hallucinated Entities in Long-Form Generation

TL;DR

This work proposes a streaming, token-level approach to detect entity-level hallucinations in long-form LLM outputs by labeling entity spans and training lightweight probes. It leverages a web-augmented frontier LLM to create finely annotated token-level data (LongFact++), enabling accurate, real-time detection with simple linear or LoRA-enhanced probes. The method outperforms uncertainty-based baselines across long-form and short-form tasks and generalizes across models, including cross-model transfer and even math reasoning tasks. Additionally, KL regularization provides a practical trade-off between detection performance and preserving original model behavior, and the framework enables selective answering to improve reliability in high-stakes settings. While promising, the approach acknowledges annotation noise and current limitations in recall at realistic false-positive rates, outlining clear directions for future deployment and extension beyond entity-level hallucinations.

Abstract

Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification. We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets \emph{entity-level hallucinations} -- e.g., fabricated names, dates, citations -- rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes. Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

Paper Structure

This paper contains 73 sections, 9 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Token-level probes detect hallucinated entities. In long-form generation settings (LongFact, HealthBench), linear probes far outperform uncertainty-based baselines, with LoRA probes improving performance even further. Our probes also perform well in short-form settings (TriviaQA), and out-of-distribution reasoning domains (MATH). Results for Llama-3.3-70B are displayed.
  • Figure 2: An annotated example of hallucination detection in long-form legal text. The underlines indicate entity spans labeled by our annotation pipeline: green denotes entities labeled as supported, while red denotes entities labeled as hallucinated. Hallucination detection probe scores for each token are shown as , with the intensity reflecting the score's magnitude (scores below 0.4 are not shown). Note that while the annotation pipeline predominantly identifies and labels entities (e.g., "Jonathan M. Madero", "BlackBerry Curve 8330"), this example illustrates the difficulty in cleanly separating entities from non-entities in long-form text. We notice that both our annotation pipeline and our resulting probes sometimes detect broader hallucinations, such as claims, even if they don't correspond cleanly to an entity (e.g., "at his residence in San Diego," which is indeed a fabricated detail).
  • Figure 3: Token-level annotation pipeline. We construct LongFact++, a large set of prompts spanning diverse domains to elicit entity-dense generations. The target LLM (e.g., Llama) produces long-form completions containing both factual and hallucinated content. A frontier LLM with web search (e.g., Claude) then identifies entities within each generation, verifies them against external sources, and produces labels indicating which entities are supported and which are not. The result is a dataset in which every token is annotated to indicate whether it forms part of a hallucination.
  • Figure 4: Generalization between short- and long-form generation settings (Llama-3.1-8B; 3 seeds per point; mean $\pm$ standard deviation AUC shown). The x-axis refers to the number of training examples from the regime indicated in the legend. Left: Performance on the short-form (TriviaQA) test set. Blue: probes trained only on short-form. Red: probes trained only on long-form. Right: Performance on the long-form (LongFact) test set. Performance gaps between training regimes are smaller on short-form tests (${<}$0.05 AUC) but much larger on long-form tests (${\sim}$0.10 AUC).
  • Figure 5: Hallucination probes exhibit strong cross-model generalization.Left: Cross-model generalization across all five models. The y-axis is the detector model (where the probe was trained), and the x-axis is the test data model (whose generations are evaluated). Right: Cross-model training-testing comparison for the Mistral-Small-24B probe. The y-axis indicates which model's generations were used as the training data source, and the x-axis indicates which model's generations were used as the test data source.
  • ...and 12 more figures