Table of Contents
Fetching ...

S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

Jiechao Gao, Chang Liu, Yuangang Li

TL;DR

This work tackles radiology report generation by addressing the lack of anatomically-grounded alignment in standard SFT methods. It introduces S2D-Align, a shallow-to-deep learning paradigm built on Progressive Anatomical Grounding (PAG) and a memory-based Shallow-to-Deep Memory Adapter (SMA) to progressively fuse visual, reference, and key-phrase information. Through a three-stage curriculum, the approach achieves state-of-the-art performance on MIMIC-CXR and IU X-Ray, with ablations confirming the value of coarse-to-fine grounding and shared feature memory. The results demonstrate improved factual correctness and clinical grounding, suggesting a viable path toward more trustworthy generative models in medical imaging contexts.

Abstract

Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation

TL;DR

This work tackles radiology report generation by addressing the lack of anatomically-grounded alignment in standard SFT methods. It introduces S2D-Align, a shallow-to-deep learning paradigm built on Progressive Anatomical Grounding (PAG) and a memory-based Shallow-to-Deep Memory Adapter (SMA) to progressively fuse visual, reference, and key-phrase information. Through a three-stage curriculum, the approach achieves state-of-the-art performance on MIMIC-CXR and IU X-Ray, with ablations confirming the value of coarse-to-fine grounding and shared feature memory. The results demonstrate improved factual correctness and clinical grounding, suggesting a viable path toward more trustworthy generative models in medical imaging contexts.

Abstract

Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textsc{S2D-Align}, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textsc{S2D-Align} implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textsc{MIMIC-CXR} and \textsc{IU X-Ray} benchmarks, where \textsc{S2D-Align} achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.

Paper Structure

This paper contains 21 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of S2D-Align, with the Progressive Anatomical Grounding (PAG) and Shallow-to-Deep Memory Adapter (SMA) modules as its core components. Herein, we use the same medical text encoders to convert reference reports or key phrases into embeddings, and adopt a shared memory bank inherited from earlier stages to later ones throughout PAG.
  • Figure 2: A case study selected from MIMIC-CXR, with medical concepts shared by the ground-truth and generated outputs highlighted in the same color. The categories and optimized parameters for the LLM-based methods are detailed in parentheses.
  • Figure 3: Illustration of the refinement prompt used in the third stage of PAG. The system prompt (above the dashed line) provides several few-shot demonstrations to demonstrate the phrase generation process via in-context learning, guiding the Large Language Model (LLM) to convert structured entity-relation tuples into coherent clinical phrases. The user prompt (below the dashed line) then supplies the new set of tuples to be processed.