Table of Contents
Fetching ...

The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

TL;DR

To address pervasive LLM hallucination, the authors advocate repeatable, open, and domain-contextualized benchmarks anchored by expert-ground truth. They present a taxonomy of hallucinations, and a case study showing that excluding domain experts during data creation yields invalid metrics. A concrete workflow for dataset development is demonstrated in politics and medicine, including 64 questions, expert GT, LLM outputs, and a 12-metric evaluation suite. The paper argues that open, expert-grounded benchmarks enable reproducibility, domain relevance, and safer deployment, while highlighting challenges such as expert cost and disagreement that require community collaboration.

Abstract

Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

TL;DR

To address pervasive LLM hallucination, the authors advocate repeatable, open, and domain-contextualized benchmarks anchored by expert-ground truth. They present a taxonomy of hallucinations, and a case study showing that excluding domain experts during data creation yields invalid metrics. A concrete workflow for dataset development is demonstrated in politics and medicine, including 64 questions, expert GT, LLM outputs, and a 12-metric evaluation suite. The paper argues that open, expert-grounded benchmarks enable reproducibility, domain relevance, and safer deployment, while highlighting challenges such as expert cost and disagreement that require community collaboration.

Abstract

Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

Paper Structure

This paper contains 26 sections, 3 tables.

Theorems & Definitions (1)

  • Definition