Table of Contents
Fetching ...

MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models

Qiao Yan, Yuchen Yuan, Xiaowei Hu, Yihan Wang, Jiaqi Xu, Jinpeng Li, Chi-Wing Fu, Pheng-Ann Heng

TL;DR

MedHallTune tackles the critical problem of hallucinations in medical vision-language models by introducing a large-scale, specialized benchmark composed of over 100k images and 1M instruction pairs that include both hallucination and non-hallucination samples. The dataset is generated with GPT-4o and validated through a two-step self-check to ensure alignment with medical ground-truth annotations, and it employs four clinical-focused metrics to assess performance. Experiments demonstrate that fine-tuning with MedHallTune improves hallucination management and enhances zero-shot medical VQA capabilities across diverse VLVMs, including both general and medical-domain systems. This work advances trustworthy medical AI by providing a rigorous evaluation framework and practical improvements for mitigating clinically harmful hallucinations.

Abstract

The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.

MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models

TL;DR

MedHallTune tackles the critical problem of hallucinations in medical vision-language models by introducing a large-scale, specialized benchmark composed of over 100k images and 1M instruction pairs that include both hallucination and non-hallucination samples. The dataset is generated with GPT-4o and validated through a two-step self-check to ensure alignment with medical ground-truth annotations, and it employs four clinical-focused metrics to assess performance. Experiments demonstrate that fine-tuning with MedHallTune improves hallucination management and enhances zero-shot medical VQA capabilities across diverse VLVMs, including both general and medical-domain systems. This work advances trustworthy medical AI by providing a rigorous evaluation framework and practical improvements for mitigating clinically harmful hallucinations.

Abstract

The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.

Paper Structure

This paper contains 8 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Examples of the medical hallucination in VLMs. The user queries about non-existent objects in orbit of the brain or incorrect medical knowledge of trauma to the abdomen. LLaVA-Med li2024llava generates a plausible response but is incorrect, as shown in red. In contrast, after fine-tuned on MedHallTune, it provides the correct answer, effectively countering the hallucination, as highlighted in green.
  • Figure 2: Overview of the pipeline, demonstrating the process of mitigation and evaluation of medical hallucinations in VLMs via instruction tuning on MedHallTune.
  • Figure 3: Examples of (a) hallucination and (b) non-hallucination instruction data. In the hallucination instructions, user questions are designed to inquire about non-existing medical objects, incorrect attributes of medical objects, and erroneous clinical knowledge. The answers are formulated to address these hallucinations by providing accurate responses. (c) Quality control by filtering out incorrect instructions.
  • Figure 4: Evaluation procedures and detailed scoring criteria.
  • Figure 5: Ablation study comparing model performance across training sets: positive (non-hallucination), negative (hallucination), and MedHallTune with and without (w.o) quality control, as well as training on 25%, 50%, 75%, and 100% of data.