Table of Contents
Fetching ...

A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation

Puzhen Wu, Hexin Dong, Yi Lin, Yihao Ding, Yifan Peng

TL;DR

The paper addresses the challenge of automatically generating clinically accurate chest X-ray reports by introducing a disease-aware dual-stage framework. Stage 1 learns Disease-Aware Semantic Tokens (DASTs) and aligns vision-language representations using cross-attention and contrastive learning, while Stage 2 fuses disease semantics with visual features through DVAF and retrieves context from similar cases via DMSR to condition a large language model. The approach yields state-of-the-art results across CheXpert Plus, IU X-Ray, and MIMIC-CXR, with ablations confirming the contributions of DASTs, DVAF, and DMSR to both linguistic quality and clinical fidelity. The work advances automated radiology reporting by integrating explicit disease guidance, efficient visual encoding, and retrieval-augmented generation, with practical impact on clinical workflow and radiology throughput.

Abstract

Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation

TL;DR

The paper addresses the challenge of automatically generating clinically accurate chest X-ray reports by introducing a disease-aware dual-stage framework. Stage 1 learns Disease-Aware Semantic Tokens (DASTs) and aligns vision-language representations using cross-attention and contrastive learning, while Stage 2 fuses disease semantics with visual features through DVAF and retrieves context from similar cases via DMSR to condition a large language model. The approach yields state-of-the-art results across CheXpert Plus, IU X-Ray, and MIMIC-CXR, with ablations confirming the contributions of DASTs, DVAF, and DMSR to both linguistic quality and clinical fidelity. The work advances automated radiology reporting by integrating explicit disease guidance, efficient visual encoding, and retrieval-augmented generation, with practical impact on clinical workflow and radiology throughput.

Abstract

Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.

Paper Structure

This paper contains 33 sections, 9 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Disease-Aware Semantic Tokens and retrieved examples transform image-only LLM reports from vague to highly specific.
  • Figure 2: Overview of the proposed two-stage framework. Stage 1 jointly trains a VMamba image encoder and a text encoder to learn disease-aware semantic tokens (DASTs), aligning visual and textual features through classification and contrastive losses. Stage 2 fuses the learned DASTs with visual features, retrieves a similar study via dual-modal similarity retrieval, and feeds these cues into a large language model to generate the final radiology report.
  • Figure 3: Visualization of generated reports on a sample X-ray.