S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

Jiechao Gao; Chang Liu; Yuangang Li

S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

Jiechao Gao, Chang Liu, Yuangang Li

TL;DR

This work tackles radiology report generation by addressing the lack of anatomically-grounded alignment in standard SFT methods. It introduces S2D-Align, a shallow-to-deep learning paradigm built on Progressive Anatomical Grounding (PAG) and a memory-based Shallow-to-Deep Memory Adapter (SMA) to progressively fuse visual, reference, and key-phrase information. Through a three-stage curriculum, the approach achieves state-of-the-art performance on MIMIC-CXR and IU X-Ray, with ablations confirming the value of coarse-to-fine grounding and shared feature memory. The results demonstrate improved factual correctness and clinical grounding, suggesting a viable path toward more trustworthy generative models in medical imaging contexts.

Abstract

Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

TL;DR

Abstract

S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)