MedHal: An Evaluation Dataset for Medical Hallucination Detection

Gaya Mehenni; Fabrice Lamarche; Odette Rios-Ibacache; John Kildea; Amal Zouaq

MedHal: An Evaluation Dataset for Medical Hallucination Detection

Gaya Mehenni, Fabrice Lamarche, Odette Rios-Ibacache, John Kildea, Amal Zouaq

TL;DR

MedHal introduces the first large-scale, domain-focused dataset for medical hallucination detection, combining a unified task formulation with multi-task data transformations (QA, IE, NLI, Summarization) and explanations for non-factual statements. The authors demonstrate the utility of MedHal by training and evaluating diverse models, showing that specialized evaluator models achieve strong factuality detection while medical models offer superior explanations. Fine-tuning on MedHal yields measurable improvements in F1 across models, with Summarization emerging as the most impactful task for transfer learning. Downstream evaluations on MedNLI and MIMIC-based hallucination datasets validate cross-task benefits, though limitations such as synthetic data and language scope are noted, pointing to future work in multi-annotator and multilingual expansion.

Abstract

We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.

MedHal: An Evaluation Dataset for Medical Hallucination Detection

TL;DR

Abstract

MedHal: An Evaluation Dataset for Medical Hallucination Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)