Table of Contents
Fetching ...

MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models

Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao

TL;DR

MedHEval presents a comprehensive benchmark for assessing hallucinations in medical vision–language models by classifying errors into visual misinterpretation, knowledge deficiency, and context misalignment. It builds diverse close- and open-ended VQA datasets (nearly 16k pairs) and evaluates 11 LVLMs across three cognitive dimensions, plus seven mitigation methods. Results show broad hallucination susceptibility across causes, with mitigation methods offering limited, sometimes inconsistent improvements, particularly for knowledge- and context-based errors. The work delivers a standardized evaluation framework, rigorous datasets, and practical guidance for developing more reliable Med-LVLMs and domain-specific mitigation strategies. A public repository accompanies the benchmark, enabling ongoing benchmarking and method development in clinical multimodal AI.

Abstract

Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain, yet Medical LVLMs (Med-LVLMs) frequently generate hallucinations due to limited expertise and the complexity of medical applications. Existing benchmarks fail to effectively evaluate hallucinations based on their underlying causes and lack assessments of mitigation strategies. To address this gap, we introduce MedHEval, a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs by categorizing them into three underlying causes: visual misinterpretation, knowledge deficiency, and context misalignment. We construct a diverse set of close- and open-ended medical VQA datasets with comprehensive evaluation metrics to assess these hallucination types. We conduct extensive experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques. Results reveal that Med-LVLMs struggle with hallucinations arising from different causes while existing mitigation methods show limited effectiveness, especially for knowledge- and context-based errors. These findings underscore the need for improved alignment training and specialized mitigation strategies to enhance Med-LVLMs' reliability. MedHEval establishes a standardized framework for evaluating and mitigating medical hallucinations, guiding the development of more trustworthy Med-LVLMs.

MedHEval: Benchmarking Hallucinations and Mitigation Strategies in Medical Large Vision-Language Models

TL;DR

MedHEval presents a comprehensive benchmark for assessing hallucinations in medical vision–language models by classifying errors into visual misinterpretation, knowledge deficiency, and context misalignment. It builds diverse close- and open-ended VQA datasets (nearly 16k pairs) and evaluates 11 LVLMs across three cognitive dimensions, plus seven mitigation methods. Results show broad hallucination susceptibility across causes, with mitigation methods offering limited, sometimes inconsistent improvements, particularly for knowledge- and context-based errors. The work delivers a standardized evaluation framework, rigorous datasets, and practical guidance for developing more reliable Med-LVLMs and domain-specific mitigation strategies. A public repository accompanies the benchmark, enabling ongoing benchmarking and method development in clinical multimodal AI.

Abstract

Large Vision Language Models (LVLMs) are becoming increasingly important in the medical domain, yet Medical LVLMs (Med-LVLMs) frequently generate hallucinations due to limited expertise and the complexity of medical applications. Existing benchmarks fail to effectively evaluate hallucinations based on their underlying causes and lack assessments of mitigation strategies. To address this gap, we introduce MedHEval, a novel benchmark that systematically evaluates hallucinations and mitigation strategies in Med-LVLMs by categorizing them into three underlying causes: visual misinterpretation, knowledge deficiency, and context misalignment. We construct a diverse set of close- and open-ended medical VQA datasets with comprehensive evaluation metrics to assess these hallucination types. We conduct extensive experiments across 11 popular (Med)-LVLMs and evaluate 7 state-of-the-art hallucination mitigation techniques. Results reveal that Med-LVLMs struggle with hallucinations arising from different causes while existing mitigation methods show limited effectiveness, especially for knowledge- and context-based errors. These findings underscore the need for improved alignment training and specialized mitigation strategies to enhance Med-LVLMs' reliability. MedHEval establishes a standardized framework for evaluating and mitigating medical hallucinations, guiding the development of more trustworthy Med-LVLMs.

Paper Structure

This paper contains 48 sections, 21 figures, 11 tables.

Figures (21)

  • Figure 1: Examples of medical hallucinations. (a)The model hallucinates a non-existent organ "spleen" and symptom "cardiomegaly", and the measurement of the cardiac silhouette as "16 cm" exaggerates the severity of the non-existing cardiomegaly. (b) The model generates incorrect knowledge, suggesting "asthma" and "pleural effusion" as potential causes, whereas the correct answer is "pulmonary edema" or "lung cancer." (c) The model incorrectly answers a contextual medical question, and the true answer should be "Yes".
  • Figure 2: Close-ended evaluation of knowledge deficiency hallucination in (Med)-LVLMs and the effectiveness of hallucination mitigation methods. For clarity, LLaVA-NeXT is denoted as LLaVA.
  • Figure 3: Close-ended evaluation of context misalignment hallucination in (Med)-LVLMs and the effectiveness of hallucination mitigation methods. For clarity, LLaVA-NeXT is denoted as LLaVA.
  • Figure 4: Prompt used to construct the close-ended dataset MM-VisHal for evaluating visual misinterpretation hallucinations with SLAKE.
  • Figure 5: Prompt for constructing the close-ended datasets CXR-VisHal in visual misinterpretation hallucination using IU-Xray and MIMIC-CXR.
  • ...and 16 more figures