Table of Contents
Fetching ...

HalluLens: LLM Hallucination Benchmark

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, Pascale Fung

TL;DR

HalluLens addresses the fragmentation in LLM hallucination evaluation by formalizing a taxonomy that separates hallucination from factuality and by introducing a dynamic, extrinsic-hallucination benchmark suite. It presents three tasks (PreciseWikiQA, LongWiki, NonExistentRefusal) to probe model consistency with training data and input contexts, alongside three intrinsic benchmarks (HHEM, ANAH 2.0, FaithEval) to assess faithfulness relative to provided contexts. The benchmark emphasizes dynamic test-set generation to mitigate leakage and provides a thorough comparison with existing factuality benchmarks, highlighting when and how existing tests can be repurposed or revised for hallucination evaluation. Overall, HalluLens offers a unified, extensible framework for measuring extrinsic and intrinsic hallucinations, aiming to guide more reliable model development and evaluation in real-world applications.

Abstract

Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination." These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from "factuality," proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.

HalluLens: LLM Hallucination Benchmark

TL;DR

HalluLens addresses the fragmentation in LLM hallucination evaluation by formalizing a taxonomy that separates hallucination from factuality and by introducing a dynamic, extrinsic-hallucination benchmark suite. It presents three tasks (PreciseWikiQA, LongWiki, NonExistentRefusal) to probe model consistency with training data and input contexts, alongside three intrinsic benchmarks (HHEM, ANAH 2.0, FaithEval) to assess faithfulness relative to provided contexts. The benchmark emphasizes dynamic test-set generation to mitigate leakage and provides a thorough comparison with existing factuality benchmarks, highlighting when and how existing tests can be repurposed or revised for hallucination evaluation. Overall, HalluLens offers a unified, extensible framework for measuring extrinsic and intrinsic hallucinations, aiming to guide more reliable model development and evaluation in real-world applications.

Abstract

Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination." These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is essential for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks, built upon clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from "factuality," proposing a clear taxonomy that distinguishes between extrinsic and intrinsic hallucinations, to promote consistency and facilitate research. Extrinsic hallucinations, where the generated content is not consistent with the training data, are increasingly important as LLMs evolve. Our benchmark includes dynamic test set generation to mitigate data leakage and ensure robustness against such leakage. We also analyze existing benchmarks, highlighting their limitations and saturation. The work aims to: (1) establish a clear taxonomy of hallucinations, (2) introduce new extrinsic hallucination tasks, with data that can be dynamically regenerated to prevent saturation by leakage, (3) provide a comprehensive analysis of existing benchmarks, distinguishing them from factuality evaluations.

Paper Structure

This paper contains 52 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: HalluLens: LLM Hallucination Benchmark. It consists of newly introduced extrinsic hallucination tasks and existing intrinsic hallucination tasks. Extrinsic hallucination test sets are dynamically generated.
  • Figure 2: Hallucination categories and factuality in LLMs: This diagram shows hallucinations in the two, extrinsic and intrinsic, categories in the blue circles, excluding “factuality” benchmarks. Existing categorizations by zhang2023siren and huang2023survey conflate hallucination with factuality and overlook extrinsic hallucination. Tasks in blue are new benchmarks in HalluLens, while the red ones conflate extrinsic hallucination with factuality. The red tasks can be adapted to extrinsic hallucination evaluation with metric modifications. The black benchmarks are suitable for intrinsic hallucinations.
  • Figure 3: Examples for each challenge, including extrinsic hallucination, intrinsic hallucination and factuality issues. Note that LLM factuality is not a type of hallucination, yet it is closely tied with hallucination problem in LLM. $^*$As of December 2024.
  • Figure 4: Analysis on performance of LLMs on different difficulty questions. Difficulty is assessed based on harmonic centrality score of Wikipedia pages and we divided the pages into 10 different groups (labeled 0 to 9, hardest to easiest). Full graph available in Appendix \ref{['app:preciseqa']}.
  • Figure 5: Results for PreciseWikiQA. The error bar shows standard deviation from three runs of evaluation.
  • ...and 6 more figures