Table of Contents
Fetching ...

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

Xiao Liang, Chenxi Liu, Zhi Ma, Di Wang, Bin Jing, Quan Wang, Yuanyuan Shi

TL;DR

This paper tackles hallucinations in Medical Vision-Language Models by introducing Anatomical Region-Guided Contrastive Decoding (ARCD), a training-free, plug-and-play decoding strategy. ARCD uses Dynamic Attention Mask Generation to convert anatomical region masks into token-level guidance, and a three-tiered Mask-Guided Conditional Token Weighting to steer generation at the token, attention, and logits levels. Experiments across chest X-ray, CT, brain MRI, and ocular ultrasound demonstrate improved regional grounding and reduced hallucinations, with thorough ablations and case studies supporting robustness. The work offers a practical, scalable approach to enhance clinical reliability of MedVLMs without additional training data or model updates.

Abstract

Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs

TL;DR

This paper tackles hallucinations in Medical Vision-Language Models by introducing Anatomical Region-Guided Contrastive Decoding (ARCD), a training-free, plug-and-play decoding strategy. ARCD uses Dynamic Attention Mask Generation to convert anatomical region masks into token-level guidance, and a three-tiered Mask-Guided Conditional Token Weighting to steer generation at the token, attention, and logits levels. Experiments across chest X-ray, CT, brain MRI, and ocular ultrasound demonstrate improved regional grounding and reduced hallucinations, with thorough ablations and case studies supporting robustness. The work offers a practical, scalable approach to enhance clinical reliability of MedVLMs without additional training data or model updates.

Abstract

Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.

Paper Structure

This paper contains 34 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: An example of hallucination driven by a statistical bias. The model misidentifies visually apparent ECG leads as a PICC line because the latter is far more common in training corpora reports. This flawed prior-visual association leads to a factually incorrect response and demonstrates a critical failure in visual grounding.
  • Figure 2: Overview of our proposed Anatomical Region-Guided Contrastive Decoding strategy. Left:Dynamic Attention Mask Generation module converts a specified anatomical region (e.g., a segmentation annotation) into a multi-scale token-level mask. Right:Mask-Guided Conditional Token Weighting module then uses this mask to steer the decoding process via a strategy that applies contrastive re-weighting at the token level, attention level, and logits level, ensuring the generated answer is grounded in the specified visual region.
  • Figure 3: GPT-4o evaluation of model responses under different settings on 250 samples uniformly drawn from three datasets. Phi-3.5V-ZS and Phi-3.5V-Med-ZS represent the zero-shot results for the base model and the model adapted with PubMedVision. Phi-3.5V-Med-FT is the model fine-tuned on the three MedVQA datasets, while w/ ARCD denotes our proposed method with attentional masking.
  • Figure 4: Parameter ablation of $\beta$ and $\gamma$ using the Phi-3.5V-Med zero-shot model, with $\alpha = 0.01$ fixed.
  • Figure 5: Ablation Study on Visual Prompting: Impact of Visual ROI and Attention on MedVLM Performance.
  • ...and 4 more figures