Table of Contents
Fetching ...

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Wen Luo, Tianshu Shen, Wei Li, Guangyue Peng, Richeng Xuan, Houfeng Wang, Xi Yang

TL;DR

HalluDial introduces the first large-scale benchmark for dialogue-level hallucination evaluation, addressing gaps in prior work by capturing both factuality and faithfulness in spontaneous and induced information-seeking dialogues. It provides 146,856 samples across 4,094 dialogues, with detailed labels for detection, localization, and rationale, enabling comprehensive meta-evaluation of LLM hallucination evaluation capabilities. The dataset underpins a specialized judge model, HalluJudge, which achieves superior or competitive performance and generalizes to out-of-domain settings, facilitating automatic assessment of dialogue-level hallucinations. The work also analyzes how factors like temperature influence hallucination rates, and it offers a practical platform (dataset and code) for researchers to study and mitigate hallucinations in real-world, knowledge-grounded dialogue systems.

Abstract

Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

TL;DR

HalluDial introduces the first large-scale benchmark for dialogue-level hallucination evaluation, addressing gaps in prior work by capturing both factuality and faithfulness in spontaneous and induced information-seeking dialogues. It provides 146,856 samples across 4,094 dialogues, with detailed labels for detection, localization, and rationale, enabling comprehensive meta-evaluation of LLM hallucination evaluation capabilities. The dataset underpins a specialized judge model, HalluJudge, which achieves superior or competitive performance and generalizes to out-of-domain settings, facilitating automatic assessment of dialogue-level hallucinations. The work also analyzes how factors like temperature influence hallucination rates, and it offers a practical platform (dataset and code) for researchers to study and mitigate hallucinations in real-world, knowledge-grounded dialogue systems.

Abstract

Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.
Paper Structure (40 sections, 8 figures, 18 tables)

This paper contains 40 sections, 8 figures, 18 tables.

Figures (8)

  • Figure 1: Samples from the HalluDial dataset, including knowledge, dialogue context, and hallucination evaluation results of the current response. Each evaluation result comprises hallucination detection, localization, and the corresponding rationale.
  • Figure 2: Topic distributions in HalluDial dataset. The samples are categorized into seven topics, with the red circles highlighting the top topics.
  • Figure 3: Impact of temperature on hallucination rate. Left: turn level. Right: dialogue level.
  • Figure 4: Topic distributions of instances where LLMs fail to detect hallucinations. (a): GPT-4o-2024-05-13. (b): GPT-4-0125-preview. (c): HalluJudge. (d): GPT-3.5-turbo. (e): Llama-2-70B-chat. (f): vicuna-33B-v1.3.
  • Figure 5: Topic distributions of instances where LLMs are prone to hallucinations. (a): GPT-4o-2024-05-13. (b): GPT-4-0125-preview. (c): GPT-4-1106-preview. (d): GPT-3.5-turbo. (e): Llama-2-70B-chat. (f): vicuna-33B-v1.3.
  • ...and 3 more figures