Table of Contents
Fetching ...

CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

TL;DR

This paper tackles the problem of medical hallucinations in radiology multimodal LLMs, which threaten clinical reliability in tasks like radiology report generation (RRG) and visual question answering (VQA). It introduces Clinical Contrastive Decoding (CCD), a training-free, inference-time framework that leverages task-specific expert models to provide structured clinical signals via two stages: Symptom-grounded Contrastive Decoding (SCD) and Expert-informed Contrastive Decoding (ECD). By blending anchor-derived symptom labels and expert-probability guidance into the decoding process with controlled weights ($\alpha=0.5$, $\beta=0.5$, $\gamma=10$), CCD improves clinical fidelity and lexical quality across multiple backbones and public datasets, achieving up to a 17% RadGraph-F1 improvement on MIMIC-CXR. The approach is lightweight, retrieval-free, and robust to adversarial expert signals, suggesting a practical path to safer radiology AI by integrating domain expertise into generation without retraining.

Abstract

Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

CCD: Mitigating Hallucinations in Radiology MLLMs via Clinical Contrastive Decoding

TL;DR

This paper tackles the problem of medical hallucinations in radiology multimodal LLMs, which threaten clinical reliability in tasks like radiology report generation (RRG) and visual question answering (VQA). It introduces Clinical Contrastive Decoding (CCD), a training-free, inference-time framework that leverages task-specific expert models to provide structured clinical signals via two stages: Symptom-grounded Contrastive Decoding (SCD) and Expert-informed Contrastive Decoding (ECD). By blending anchor-derived symptom labels and expert-probability guidance into the decoding process with controlled weights (, , ), CCD improves clinical fidelity and lexical quality across multiple backbones and public datasets, achieving up to a 17% RadGraph-F1 improvement on MIMIC-CXR. The approach is lightweight, retrieval-free, and robust to adversarial expert signals, suggesting a practical path to safer radiology AI by integrating domain expertise into generation without retraining.

Abstract

Multimodal large language models (MLLMs) have recently achieved remarkable progress in radiology by integrating visual perception with natural language understanding. However, they often generate clinically unsupported descriptions, known as medical hallucinations, which pose serious risks in medical applications that demand accuracy and image-grounded outputs. Through empirical analysis, we find that prompt-induced hallucinations remain prevalent in radiology MLLMs, largely due to over-sensitivity to clinical sections. To address this, we introduce Clinical Contrastive Decoding (CCD), a training-free and retrieval-free inference framework that integrates structured clinical signals from task-specific radiology expert models. CCD introduces a dual-stage contrastive mechanism to refine token-level logits during generation, thereby enhancing clinical fidelity without modifying the base MLLM. Experiments on three datasets and multiple models demonstrate that CCD consistently improves overall performance on radiology report generation (RRG). On the MIMIC-CXR dataset, it yields up to a 17% improvement in RadGraph-F1 when applied to state-of-the-art RRG models. Our approach provides a lightweight and generalisable solution for mitigating medical hallucinations, effectively bridging expert models and MLLMs in radiology.

Paper Structure

This paper contains 66 sections, 7 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Illustration of the medical hallucinations in MLLMs across two tasks: (a) MAIRA-2 bannur2024maira2groundedradiologyreport for the radiology report generation and (b) LLaVA-Med li2023llavamed for visual question answering. Medical hallucinations are highlighted in red, referring to generated clinical content that is not supported by the image. Clinically irrelevant or counterfactual information in the reference clinical section is shown in blue. With our Clinical Contrastive Decoding (CCD), medical hallucinations in the baseline models are mitigated across both tasks and question types.
  • Figure 2: Overview of the CCD framework, which leverages a foundation expert model to enforce clinical consistency in MLLM outputs. During inference, it operates in two stages: (a) Symptom-grounded Contrastive Decoding, which incorporates structured clinical labels from the expert model; and (b) Expert-informed Contrastive Decoding, which adjusts the latent token logits using expert-derived confidence scores. The output logits are hierarchically calibrated to better match the ground-truth clinical labels. Hallucinated symptoms in the model output are marked in red.
  • Figure 3: Ablation study of guidance strength ($\alpha$, $\beta$) ranging from 0 to 1, with others fixed at default.
  • Figure 4: Illustration of additional VQA cases with CCD, using LLaVA-Med li2023llavamed as the baseline. (a) is a location-specific question and (b) a type-specific question. $\bm\alpha$, $\bm\beta$, and $\bm\lambda$ denote CCD hyperparameters during inference. Model outputs that are vague or under-specified (i.e., partially correct but lacking clinical precision) are highlighted in blue. Latent logit ratio plots illustrate token-level differences, with (a) highlighting the final term and (b) the second token. In both cases, the top-5 overlapping tokens across two hyperparameter settings are shown as examples. The chest X-ray is blurred to preserve privacy and minimise visual discomfort.