SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models
Manav Nitin Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray, Pawan Goyal, Debdoot Sheet
TL;DR
SERPENT-VLM addresses hallucination in radiology report generation by integrating a self-refining objective with the standard causal language modeling loss. The model grounds generated text to the input X-ray through a pooled image representation and a contextual representation of the report, optimized via a weighted total loss $\mathcal{L}_{total} = \lambda_{report} \mathcal{L}_{report} + \lambda_{refine} \mathcal{L}_{refine}$. It employs a Swin-Transformer-V2 visual encoder, a trainable visual mapper, and LLaMA2-7B with LoRA, maintaining inference speed while improving grounding. Empirical results on IU-Xray and ROCO show state-of-the-art performance and robustness to noisy images, with ablations highlighting the effectiveness of joint loss optimization and attention-based contextual aggregation. The work opens avenues for self-supervised refinement in medical imaging, with potential extensions to other modalities and diagnostic contexts.
Abstract
Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.
