HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Wen Luo; Tianshu Shen; Wei Li; Guangyue Peng; Richeng Xuan; Houfeng Wang; Xi Yang

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Wen Luo, Tianshu Shen, Wei Li, Guangyue Peng, Richeng Xuan, Houfeng Wang, Xi Yang

TL;DR

HalluDial introduces the first large-scale benchmark for dialogue-level hallucination evaluation, addressing gaps in prior work by capturing both factuality and faithfulness in spontaneous and induced information-seeking dialogues. It provides 146,856 samples across 4,094 dialogues, with detailed labels for detection, localization, and rationale, enabling comprehensive meta-evaluation of LLM hallucination evaluation capabilities. The dataset underpins a specialized judge model, HalluJudge, which achieves superior or competitive performance and generalizes to out-of-domain settings, facilitating automatic assessment of dialogue-level hallucinations. The work also analyzes how factors like temperature influence hallucination rates, and it offers a practical platform (dataset and code) for researchers to study and mitigate hallucinations in real-world, knowledge-grounded dialogue systems.

Abstract

Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

TL;DR

Abstract

Paper Structure (40 sections, 8 figures, 18 tables)

This paper contains 40 sections, 8 figures, 18 tables.

Introduction
The HalluDial Benchmark
Spontaneous Hallucination Scenario
Diverse Dialogue Sampling
Automatic Hallucination Annotation
Induced Hallucination Scenario
Implementation Details
Benchmark Statistics and Usage
Evaluating Hallucination Detection, Localization and Explanation Capabilities
Hallucination Detection
Hallucination Localization and Explanation
Generalizability of HalluJudge
Evaluating the Hallucinations of LLMs in Information-Seeking Dialogues
Main Results
Impact of Temperature on Hallucinations
...and 25 more sections

Figures (8)

Figure 1: Samples from the HalluDial dataset, including knowledge, dialogue context, and hallucination evaluation results of the current response. Each evaluation result comprises hallucination detection, localization, and the corresponding rationale.
Figure 2: Topic distributions in HalluDial dataset. The samples are categorized into seven topics, with the red circles highlighting the top topics.
Figure 3: Impact of temperature on hallucination rate. Left: turn level. Right: dialogue level.
Figure 4: Topic distributions of instances where LLMs fail to detect hallucinations. (a): GPT-4o-2024-05-13. (b): GPT-4-0125-preview. (c): HalluJudge. (d): GPT-3.5-turbo. (e): Llama-2-70B-chat. (f): vicuna-33B-v1.3.
Figure 5: Topic distributions of instances where LLMs are prone to hallucinations. (a): GPT-4o-2024-05-13. (b): GPT-4-0125-preview. (c): GPT-4-1106-preview. (d): GPT-3.5-turbo. (e): Llama-2-70B-chat. (f): vicuna-33B-v1.3.
...and 3 more figures

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

TL;DR

Abstract

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)