A Survey of Automatic Hallucination Evaluation on Natural Language Generation

Siya Qi; Lin Gui; Yulan He; Zheng Yuan

A Survey of Automatic Hallucination Evaluation on Natural Language Generation

Siya Qi, Lin Gui, Yulan He, Zheng Yuan

TL;DR

This survey systematically maps Automatic Hallucination Evaluation (AHE) for Natural Language Generation, clarifying the distinction between Source Faithfulness ($SF$) and World Factuality ($WF$) and charting advances from pre-LLM to post-LLM eras. It catalogues foundational datasets, task-specific and general benchmarks, meta-evaluation efforts, and automated data generation, then organizes methodologies into reference-based, reference-free, and LLM-based paradigms, with domain-specific extensions. The work highlights fundamental limitations, practical deployment challenges, and directions toward interpretability, efficiency, and controllable SF/WF, offering a roadmap for robust, real-world AHE systems. Overall, it provides a comprehensive, up-to-date framework to guide future benchmark construction, metric development, and evaluation tooling in diverse NLP applications.

Abstract

The rapid advancement of Large Language Models (LLMs) has brought a pressing challenge: how to reliably assess hallucinations to guarantee model trustworthiness. Although Automatic Hallucination Evaluation (AHE) has become an indispensable component of this effort, the field remains fragmented in its methodologies, limiting both conceptual clarity and practical progress. This survey addresses this critical gap through a systematic analysis of 105 evaluation methods, revealing that 77.1% specifically target LLMs, a paradigm shift that demands new evaluation frameworks. We formulate a structured framework to organize the field, based on a survey of foundational datasets and benchmarks and a taxonomy of evaluation methodologies, which together systematically document the evolution from pre-LLM to post-LLM approaches. Beyond taxonomical organization, we identify fundamental limitations in current approaches and their implications for real-world deployment. To guide future research, we delineate key challenges and propose strategic directions, including enhanced interpretability mechanisms and integration of application-specific evaluation criteria, ultimately providing a roadmap for developing more robust and practical hallucination evaluation systems.

A Survey of Automatic Hallucination Evaluation on Natural Language Generation

TL;DR

This survey systematically maps Automatic Hallucination Evaluation (AHE) for Natural Language Generation, clarifying the distinction between Source Faithfulness (

) and World Factuality (

) and charting advances from pre-LLM to post-LLM eras. It catalogues foundational datasets, task-specific and general benchmarks, meta-evaluation efforts, and automated data generation, then organizes methodologies into reference-based, reference-free, and LLM-based paradigms, with domain-specific extensions. The work highlights fundamental limitations, practical deployment challenges, and directions toward interpretability, efficiency, and controllable SF/WF, offering a roadmap for robust, real-world AHE systems. Overall, it provides a comprehensive, up-to-date framework to guide future benchmark construction, metric development, and evaluation tooling in diverse NLP applications.

Abstract

Paper Structure (64 sections, 4 figures, 6 tables)

This paper contains 64 sections, 4 figures, 6 tables.

Introduction
Scope of the Survey
Compare with Existing Surveys
Structure of the Survey
Foundations of AHE: Datasets and Benchmarks
Task-specific Benchmarks
General Factuality Benchmarks
Knowledge-grounded QA
Probing Model Truthfulness
Fresh Facts
Fact Reasoning
Benchmarks for Application
Long Context/Generation
Non-English Languages
High-Stakes Domains
...and 49 more sections

Figures (4)

Figure 1: Source Faithful Error (SFE) and World Factual Error (WFE) examples. The correct album is "1989", but the source document contains incorrect information. If the generated text says "1988", it is SF but has WFE. If it corrects to "1989", it is WF but has SFE. When the text exhibits both SFE and WFE, it often includes non-factual content not from the source, e.g. the incorrect statements about Travis Kelce is serving the Cincinnati Bearcats football team. Otherwise, if no such errors are present, the text should be both SF and WF.
Figure 2: Taxonomy of AHE methods (highlighted nodes with shading) based on the distinct techniques employed at each stage of the pipeline.
Figure 3: Distribution of tasks and their corresponding methods before and after the LLM era.
Figure 4: PRISMA flow diagram.

A Survey of Automatic Hallucination Evaluation on Natural Language Generation

TL;DR

Abstract

A Survey of Automatic Hallucination Evaluation on Natural Language Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)