SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Manav Nitin Kapadnis; Sohan Patnaik; Abhilash Nandy; Sourjyadip Ray; Pawan Goyal; Debdoot Sheet

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Manav Nitin Kapadnis, Sohan Patnaik, Abhilash Nandy, Sourjyadip Ray, Pawan Goyal, Debdoot Sheet

TL;DR

SERPENT-VLM addresses hallucination in radiology report generation by integrating a self-refining objective with the standard causal language modeling loss. The model grounds generated text to the input X-ray through a pooled image representation and a contextual representation of the report, optimized via a weighted total loss $\mathcal{L}_{total} = \lambda_{report} \mathcal{L}_{report} + \lambda_{refine} \mathcal{L}_{refine}$. It employs a Swin-Transformer-V2 visual encoder, a trainable visual mapper, and LLaMA2-7B with LoRA, maintaining inference speed while improving grounding. Empirical results on IU-Xray and ROCO show state-of-the-art performance and robustness to noisy images, with ablations highlighting the effectiveness of joint loss optimization and attention-based contextual aggregation. The work opens avenues for self-supervised refinement in medical imaging, with potential extensions to other modalities and diagnostic contexts.

Abstract

Radiology Report Generation (R2Gen) demonstrates how Multi-modal Large Language Models (MLLMs) can automate the creation of accurate and coherent radiological reports. Existing methods often hallucinate details in text-based reports that don't accurately reflect the image content. To mitigate this, we introduce a novel strategy, SERPENT-VLM (SElf Refining Radiology RePort GENeraTion using Vision Language Models), which improves the R2Gen task by integrating a self-refining mechanism into the MLLM framework. We employ a unique self-supervised loss that leverages similarity between pooled image representations and the contextual representations of the generated radiological text, alongside the standard Causal Language Modeling objective, to refine image-text representations. This allows the model to scrutinize and align the generated text through dynamic interaction between a given image and the generated text, therefore reducing hallucination and continuously enhancing nuanced report generation. SERPENT-VLM outperforms existing baselines such as LLaVA-Med, BiomedGPT, etc., achieving SoTA performance on the IU X-ray and Radiology Objects in COntext (ROCO) datasets, and also proves to be robust against noisy images. A qualitative case study emphasizes the significant advancements towards more sophisticated MLLM frameworks for R2Gen, opening paths for further research into self-supervised refinement in the medical imaging domain.

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

TL;DR

. It employs a Swin-Transformer-V2 visual encoder, a trainable visual mapper, and LLaMA2-7B with LoRA, maintaining inference speed while improving grounding. Empirical results on IU-Xray and ROCO show state-of-the-art performance and robustness to noisy images, with ablations highlighting the effectiveness of joint loss optimization and attention-based contextual aggregation. The work opens avenues for self-supervised refinement in medical imaging, with potential extensions to other modalities and diagnostic contexts.

Abstract

Paper Structure (13 sections, 6 equations, 3 figures, 2 tables)

This paper contains 13 sections, 6 equations, 3 figures, 2 tables.

Introduction
Related Work
Methodology
Overview of SERPENT-VLM
SERPENT-VLM Framework
Self-refining Strategy
Experiments and Evaluation
Implementation Details
Datasets and Evaluation Metrics:
Performance of SERPENT-VLM on Radiology Report Generation
Discussion on the Impact of different Design Choices for SERPENT-VLM
How robust is SERPENT-VLM to noisy images?
Summary and Conclusion

Figures (3)

Figure 1: Generated report samples on IU-Xray dataset. We qualitatively analyze reports generated by medical pre-trained LLMs LlaVA-Med and BioMedGPT with SERPENT-VLM. Hallucinated information in the reports is highlighted using yellow.
Figure 2: Overview of the SERPENT-VLM pipeline. The X-ray image is processed using a visual encoder (step 1) and projected onto a high-dimensional space using a visual mapper (step 2). The encoded image with the report generation prompt is fed into the LLM (step 3). Cross-entropy loss is employed (step 4) for the causal language modeling objective. The pooled image representation and the Contextual representation of the generated report are used to compute the self-refining loss (step 5). A weighted combination of both objectives is used to train the network (step 6).
Figure 3: Comparative performance metrics for ROCO and IU-Xray datasets.

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

TL;DR

Abstract

SERPENT-VLM : Self-Refining Radiology Report Generation Using Vision Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)