Table of Contents
Fetching ...

DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

Jean Seo, Jongwon Lim, Dongjun Jang, Hyopil Shin

TL;DR

DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain, is introduced and holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains.

Abstract

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.

DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

TL;DR

DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain, is introduced and holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains.

Abstract

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.

Paper Structure

This paper contains 24 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The DAHL benchmark dataset construction process.
  • Figure 2: The categorical distribution of the DAHL benchmark dataset.
  • Figure 3: The automated hallucination evaluation pipeline.
  • Figure 4: An example of generated response and its two versions of broken down units one from the Splitter model and the other from human annotation. The Splitter enables a thorough evaluation through splitting a response into units that contain one piece of information about an entity or a relationship between information.
  • Figure 5: The DAHL Score for Llama-3.1-8b, Gemma-2-9b, Qwen-2-8b, and Mistral-Nemo-Base-2407 (12 billion parameters) evaluated across temperatures ranging from 0.1 to 1.0. The optimal temperature for each model falls within the range of 0.1 to 0.3, with a slight linear decrease in DAHL Score as the temperature increases.
  • ...and 5 more figures