Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose; Ravi K. Rajendran; Biplob Debnath; Konstantinos Karydis; Amit K. Roy-Chowdhury; Srimat Chakradhar

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar

TL;DR

VALOR introduces a post-training multimodal alignment framework for Medical Vision-Language Models to generate visually grounded and clinically accurate radiology reports. It blends verifiable textual rewards with image-conditioned visual rewards using Group-Relative Proximal Optimization, following a supervised fine-tuning stage. Across IU-XRay and MIMIC-CXR, VALOR achieves superior generation quality and clinical accuracy, with attention maps that concentrate on disease-relevant regions, outperforming preference-based, retrieval-based, and zero-shot multimodal LLM baselines. This approach reduces visual hallucinations and enhances image-to-report alignment without external preference data or retrieval systems, promising practical impact for automated radiology workflows.

Abstract

Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

TL;DR

Abstract

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)