CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding
Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho
TL;DR
This paper tackles the problem of medical hallucinations in radiology multimodal LLMs, which threaten clinical reliability in tasks like radiology report generation (RRG) and visual question answering (VQA). It introduces Clinical Contrastive Decoding (CCD), a training-free, inference-time framework that leverages task-specific expert models to provide structured clinical signals via two stages: Symptom-grounded Contrastive Decoding (SCD) and Expert-informed Contrastive Decoding (ECD). By blending anchor-derived symptom labels and expert-probability guidance into the decoding process with controlled weights ($\alpha=0.5$, $\beta=0.5$, $\gamma=10$), CCD improves clinical fidelity and lexical quality across multiple backbones and public datasets, achieving up to a 17% RadGraph-F1 improvement on MIMIC-CXR. The approach is lightweight, retrieval-free, and robust to adversarial expert signals, suggesting a practical path to safer radiology AI by integrating domain expertise into generation without retraining.
Abstract
Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.
