Table of Contents
Fetching ...

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

Sarosij Bose, Ravi K. Rajendran, Biplob Debnath, Konstantinos Karydis, Amit K. Roy-Chowdhury, Srimat Chakradhar

TL;DR

VALOR introduces a post-training multimodal alignment framework for Medical Vision-Language Models to generate visually grounded and clinically accurate radiology reports. It blends verifiable textual rewards with image-conditioned visual rewards using Group-Relative Proximal Optimization, following a supervised fine-tuning stage. Across IU-XRay and MIMIC-CXR, VALOR achieves superior generation quality and clinical accuracy, with attention maps that concentrate on disease-relevant regions, outperforming preference-based, retrieval-based, and zero-shot multimodal LLM baselines. This approach reduces visual hallucinations and enhances image-to-report alignment without external preference data or retrieval systems, promising practical impact for automated radiology workflows.

Abstract

Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation

TL;DR

VALOR introduces a post-training multimodal alignment framework for Medical Vision-Language Models to generate visually grounded and clinically accurate radiology reports. It blends verifiable textual rewards with image-conditioned visual rewards using Group-Relative Proximal Optimization, following a supervised fine-tuning stage. Across IU-XRay and MIMIC-CXR, VALOR achieves superior generation quality and clinical accuracy, with attention maps that concentrate on disease-relevant regions, outperforming preference-based, retrieval-based, and zero-shot multimodal LLM baselines. This approach reduces visual hallucinations and enhances image-to-report alignment without external preference data or retrieval systems, promising practical impact for automated radiology workflows.

Abstract

Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.

Paper Structure

This paper contains 11 sections, 11 equations, 4 figures, 1 table, 1 algorithm.

Figures (4)

  • Figure 1: Overview: We introduce VALOR, a multimodal alignment framework that addresses the visual hallucinations of existing Med-VLMs li2023llava. Left: the generated response from vanilla Med-VLM where the misalignment with the input image is marked in red and generic non-medical text in cyan. Existing approaches resort to the curation of preference data per-task or on retrieval of reports which is inherently time-consuming and does not address the issue of multimodal alignment. Right: Our proposed VALOR addresses this issue with a novel visual reasoning based pipeline which utilizes multi-modal rewards for optimization and pushes the model to generate more clinical informed and grounded reports aligned with the original image which are marked in green.
  • Figure 2: Comparison with SOTA methods.VALOR performs significantly better in both generation papineni2002bleulin2004rougedenkowski2011meteor and clinical metrics smit2020chexbertjain2021radgraph compared to preference-data and retrieval-based methods. Further, they don't account for visual grounding w.r.t the input image as shown in the table.
  • Figure 3: Workflow of VALOR.Left: In Stage-1, VALOR is trained with verifiable textual rewards $R_{ver}$ with periodic clinical guidance $R_{clin}$ every $k_{clin}$ policy steps to familiarize the base Med-VLM $\mathcal{M(\cdot)}$li2023llava with radiology reasoning patterns. Right: In Stage-2, VALOR is further optimized with rewards aligned using image-text similarity scores computed by using embeddings obtained utilizing the multi-label domain expert model $\mathcal{G}$irvin2019chexpert, shifting the model's reasoning focus from textual knowledge, to grounded, region-wise disease reasoning capabilities. We also utilize format rewards to maintain structure of generated reports using <report> tags.
  • Figure 4: Generated responses with VALOR.Top Row: LLaVA-Med li2023llava fails to detect subtle reasoning patterns needed for accurate diagnosis, resulting in hallucination of disease observations such as "consolidation". In contrast, VALOR with verifiable guidance leads to significantly better reasoning and diagnosis capabilities. After Stage-2, VALOR's visual enhanced reasoning leads to correctly diagonizing all observations showcasing the need for awareness of disease-relevant portions and sustained attention. Bottom Row: LLaVA-Med again hallucinates diseases due to insufficient visual focus, in comparison VALOR reduces hallucinations and generates reports faithful to the X-Ray exhibiting better anatomical and semantic awareness.