Table of Contents
Fetching ...

Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering

Zehui Liao, Shishuai Hu, Ke Zou, Huazhu Fu, Liangli Zhen, Yong Xia

TL;DR

Medical Visual Question Answering systems risk producing hallucinations that contradict input images, threatening clinical reliability. The paper presents Vision Amplified Semantic Entropy (VASE), a three-stage, perturbation-based framework that combines weak visual transformations with a contrastive distribution to ground semantic predictions in visual evidence. VASE estimates a semantic predictive distribution, forms a contrastive distribution via a distortion-augmented input, and uses its entropy as a hallucination score, with a learned threshold. Empirical results on two medical VQA datasets show VASE outperforms seven baselines across open-ended and full test sets, with ablations highlighting the importance of weak transformations and contrastive amplification. This approach advances trustworthy deployment of MLLMs for medical imaging by improving detection of care-critical hallucinations.

Abstract

Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.

Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering

TL;DR

Medical Visual Question Answering systems risk producing hallucinations that contradict input images, threatening clinical reliability. The paper presents Vision Amplified Semantic Entropy (VASE), a three-stage, perturbation-based framework that combines weak visual transformations with a contrastive distribution to ground semantic predictions in visual evidence. VASE estimates a semantic predictive distribution, forms a contrastive distribution via a distortion-augmented input, and uses its entropy as a hallucination score, with a learned threshold. Empirical results on two medical VQA datasets show VASE outperforms seven baselines across open-ended and full test sets, with ablations highlighting the importance of weak transformations and contrastive amplification. This approach advances trustworthy deployment of MLLMs for medical imaging by improving detection of care-critical hallucinations.

Abstract

Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.

Paper Structure

This paper contains 9 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of VASE estimation for hallucination detection in medical VQA. VASE refers to the entropy of the estimated C-SPD. A means semantic equivalence classes alignment. 'Prob.' means the probability.
  • Figure 2: The AUG curves of SE and VASE on the (a) open-ended test samples and (b) all test samples of MIMIC-Diff-VQA. Each point on the curve represents the mean GREEN score of the model on the most-confident X% of samples identified by SE/VASE. The red dash line refers to the mean GREEN score of all samples under setting (a) and (b).