Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Zahra Mahdavi; Zahra Khodakaramimaghsoud; Hooman Khaloo; Sina Bakhshandeh Taleshani; Erfan Hashemi; Javad Mirzapour Kaleybar; Omid Nejati Manzari

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

Zahra Mahdavi, Zahra Khodakaramimaghsoud, Hooman Khaloo, Sina Bakhshandeh Taleshani, Erfan Hashemi, Javad Mirzapour Kaleybar, Omid Nejati Manzari

TL;DR

<3-5 sentence high-level summary> Med-VCD tackles hallucinations in medical vision-language models by introducing a decoding-time framework that jointly enforces visual grounding and efficiency. It combines Visual-Aware Token Selection (VATS), Sparse-based Visual Contrastive Decoding (SVCD), and Sinking Attention Calibration (SAC) to prune tokens, contrast logits, and stabilize attention without requiring retraining or multi-round decoding. Across eight medical datasets spanning radiology, ophthalmology, and pathology, Med-VCD yields substantial gains in factual accuracy and reduced hallucination rates while maintaining decoding speed, outperforming decoding-based and retrieval-augmented baselines. The approach demonstrates strong cross-domain generalization and plug-and-play applicability across architectures, underscoring its potential to improve reliability in clinical AI tools.</p>

Abstract

Large vision-language models (LVLMs) are now central to healthcare applications such as medical visual question answering and imaging report generation. Yet, these models remain vulnerable to hallucination outputs that appear plausible but are in fact incorrect. In the natural image domain, several decoding strategies have been proposed to mitigate hallucinations by reinforcing visual evidence, but most rely on secondary decoding or rollback procedures that substantially slow inference. Moreover, existing solutions are often domain-specific and may introduce misalignment between modalities or between generated and ground-truth content. We introduce Med-VCD, a sparse visual-contrastive decoding method that mitigates hallucinations in medical LVLMs without the time overhead of secondary decoding. Med-VCD incorporates a novel token-sparsification strategy that selects visually informed tokens on the fly, trimming redundancy while retaining critical visual context and thus balancing efficiency with reliability. Evaluations on eight medical datasets, spanning ophthalmology, radiology, and pathology tasks in visual question answering, report generation, and dedicated hallucination benchmarks, show that Med-VCD raises factual accuracy by an average of 13\% and improves hallucination accuracy by 6\% relative to baseline medical LVLMs.

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

TL;DR

Abstract

Med-VCD: Mitigating Hallucination for Medical Large Vision Language Models through Visual Contrastive Decoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)