Table of Contents
Fetching ...

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Zahra Mahdavi, Zahra Khodakaramimaghsoud, Hooman Khaloo, Sina Bakhshandeh Taleshani, Erfan Hashemi, Javad Mirzapour Kaleybar, Omid Nejati Manzari

TL;DR

<3-5 sentence high-level summary> Med-VCD tackles hallucinations in medical vision-language models by introducing a decoding-time framework that jointly enforces visual grounding and efficiency. It combines Visual-Aware Token Selection (VATS), Sparse-based Visual Contrastive Decoding (SVCD), and Sinking Attention Calibration (SAC) to prune tokens, contrast logits, and stabilize attention without requiring retraining or multi-round decoding. Across eight medical datasets spanning radiology, ophthalmology, and pathology, Med-VCD yields substantial gains in factual accuracy and reduced hallucination rates while maintaining decoding speed, outperforming decoding-based and retrieval-augmented baselines. The approach demonstrates strong cross-domain generalization and plug-and-play applicability across architectures, underscoring its potential to improve reliability in clinical AI tools.</p>

Abstract

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

TL;DR

<3-5 sentence high-level summary> Med-VCD tackles hallucinations in medical vision-language models by introducing a decoding-time framework that jointly enforces visual grounding and efficiency. It combines Visual-Aware Token Selection (VATS), Sparse-based Visual Contrastive Decoding (SVCD), and Sinking Attention Calibration (SAC) to prune tokens, contrast logits, and stabilize attention without requiring retraining or multi-round decoding. Across eight medical datasets spanning radiology, ophthalmology, and pathology, Med-VCD yields substantial gains in factual accuracy and reduced hallucination rates while maintaining decoding speed, outperforming decoding-based and retrieval-augmented baselines. The approach demonstrates strong cross-domain generalization and plug-and-play applicability across architectures, underscoring its potential to improve reliability in clinical AI tools.</p>

Abstract

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.

Paper Structure

This paper contains 22 sections, 7 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Illustrative cases of medical hallucination include the following: (a) The model incorrectly answers a context-dependent medical question; the correct response should be “No.” (b) The model fabricates clinical knowledge, proposing “pleural effusion” and “asthma,” whereas the appropriate diagnoses are “lung cancer” or “pulmonary edema.” (c) The model hallucinates the nonexistent symptom “pleural effusions” and overlooks diffuse indistinctness of the pulmonary vasculature—a radiographic finding characteristic of “pulmonary edema”.
  • Figure 2: An overview of the proposed Med-VCD approach, consisting of (1) a sparse-based VCD method; (2) the visual-aware token selection(3); Sinking Attention Calibration.
  • Figure 3: Analysis token sorting by attention score using LLaVA-Med-1.5.
  • Figure 4: Analysis attention density of textual and visual tokens using LLaVA-Med-1.5.
  • Figure 5: Close-ended evaluation of knowledge deficiency hallucination in (Med)-LVLMs and the effectiveness of hallucination mitigation methods.
  • ...and 2 more figures