Table of Contents
Fetching ...

HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

Abhilasha Ravichander, Shrusti Ghela, David Wadden, Yejin Choi

TL;DR

HALoGEN introduces a large-scale, multi-domain hallucination benchmark for generative LLMs, comprising 10,923 prompts and ~150,000 generations from 14 models across nine tasks and two task modes (response-based and refusal-based). Each generation is decomposed into atomic units by task-specific engines and verified against high-quality sources, enabling precise categorization of hallucinations (Type A/B/C) and attribution to training data. The framework yields three evaluation metrics—Hallucination Score, Response Ratio, and Utility Score—and reveals substantial model hallucinations even among top performers, with domain-dependent patterns and differing behavior between open-source and closed models. The work provides a principled dataset and methodology for analyzing, attributing, and mitigating hallucinations, highlighting the need for multi-domain evaluation and retrieval-based or uncertainty-aware strategies to build more truthful AI systems.

Abstract

Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.

HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

TL;DR

HALoGEN introduces a large-scale, multi-domain hallucination benchmark for generative LLMs, comprising 10,923 prompts and ~150,000 generations from 14 models across nine tasks and two task modes (response-based and refusal-based). Each generation is decomposed into atomic units by task-specific engines and verified against high-quality sources, enabling precise categorization of hallucinations (Type A/B/C) and attribution to training data. The framework yields three evaluation metrics—Hallucination Score, Response Ratio, and Utility Score—and reveals substantial model hallucinations even among top performers, with domain-dependent patterns and differing behavior between open-source and closed models. The work provides a principled dataset and methodology for analyzing, attributing, and mitigating hallucinations, highlighting the need for multi-domain evaluation and retrieval-based or uncertainty-aware strategies to build more truthful AI systems.

Abstract

Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
Paper Structure (83 sections, 4 equations, 4 figures, 6 tables)

This paper contains 83 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Hallucination evaluation for code and citation generation, two of nine evaluation settings in HALoGEN. Given an input prompt, we decompose each model response by identifying verifiable atomic units: package imports and paper citations, respectively. Then, we verify each unit to determine whether the unit is factual or hallucinated. Finally, we classify hallucinated facts into one of three categories based on relationship to training data (§ \ref{['sec:intro']}).
  • Figure 2: Spearman correlation of model rankings across datasets. We observe that model hallucinations can vary considerably by domain, highlighting the need for a diverse benchmark to study hallucination patterns.
  • Figure 3: The counts of types of model hallucinations when describing hypothetical historical events. Models seldom make Type B errors, indicating there is unlikely to be basis in pretraining data.
  • Figure 4: (a): Counts of types of model hallucinations on educational affiliations of senators. Models often hallucinate despite evidence of the correct fact within pretraining corpora. (b): Distribution of hallucination types in model generations for a content-grounded task: abstractive summarization. The vast majority of model hallucinations do not stem from the introduction of an external fact.