A Disease-Aware Dual-Stage Framework for Chest X-ray Report Generation
Puzhen Wu, Hexin Dong, Yi Lin, Yihao Ding, Yifan Peng
TL;DR
The paper addresses the challenge of automatically generating clinically accurate chest X-ray reports by introducing a disease-aware dual-stage framework. Stage 1 learns Disease-Aware Semantic Tokens (DASTs) and aligns vision-language representations using cross-attention and contrastive learning, while Stage 2 fuses disease semantics with visual features through DVAF and retrieves context from similar cases via DMSR to condition a large language model. The approach yields state-of-the-art results across CheXpert Plus, IU X-Ray, and MIMIC-CXR, with ablations confirming the contributions of DASTs, DVAF, and DMSR to both linguistic quality and clinical fidelity. The work advances automated radiology reporting by integrating explicit disease guidance, efficient visual encoding, and retrieval-augmented generation, with practical impact on clinical workflow and radiology throughput.
Abstract
Radiology report generation from chest X-rays is an important task in artificial intelligence with the potential to greatly reduce radiologists' workload and shorten patient wait times. Despite recent advances, existing approaches often lack sufficient disease-awareness in visual representations and adequate vision-language alignment to meet the specialized requirements of medical image analysis. As a result, these models usually overlook critical pathological features on chest X-rays and struggle to generate clinically accurate reports. To address these limitations, we propose a novel dual-stage disease-aware framework for chest X-ray report generation. In Stage~1, our model learns Disease-Aware Semantic Tokens (DASTs) corresponding to specific pathology categories through cross-attention mechanisms and multi-label classification, while simultaneously aligning vision and language representations via contrastive learning. In Stage~2, we introduce a Disease-Visual Attention Fusion (DVAF) module to integrate disease-aware representations with visual features, along with a Dual-Modal Similarity Retrieval (DMSR) mechanism that combines visual and disease-specific similarities to retrieve relevant exemplars, providing contextual guidance during report generation. Extensive experiments on benchmark datasets (i.e., CheXpert Plus, IU X-ray, and MIMIC-CXR) demonstrate that our disease-aware framework achieves state-of-the-art performance in chest X-ray report generation, with significant improvements in clinical accuracy and linguistic quality.
