DAMPER: A Dual-Stage Medical Report Generation Framework with Coarse-Grained MeSH Alignment and Fine-Grained Hypergraph Matching
Xiaofei Huang, Wenting Chen, Jie Liu, Qisheng Lu, Xiaoling Luo, Linlin Shen
TL;DR
Medical report generation must bridge imaging interpretation with clinically valid narrative content. DAMPER presents a dual-stage framework that first aligns CXR features with MeSH terms using MeSH encoding, GAN-based MCA, and CMG to produce coarse radiological representations, then employs hypergraphs for intra- and inter-patient fine-grained alignment before decoding the final report. The approach yields superior METEOR and CE metrics on IU-Xray and MIMIC-CXR, with strong zero-shot generalization to MIMIC-ABN, demonstrating improved semantic fidelity and clinical relevance. By integrating MeSH knowledge and hypergraph-based high-order relationships, DAMPER closely mirrors the radiologist workflow and improves robustness to missing views.
Abstract
Medical report generation is crucial for clinical diagnosis and patient management, summarizing diagnoses and recommendations based on medical imaging. However, existing work often overlook the clinical pipeline involved in report writing, where physicians typically conduct an initial quick review followed by a detailed examination. Moreover, current alignment methods may lead to misaligned relationships. To address these issues, we propose DAMPER, a dual-stage framework for medical report generation that mimics the clinical pipeline of report writing in two stages. In the first stage, a MeSH-Guided Coarse-Grained Alignment (MCG) stage that aligns chest X-ray (CXR) image features with medical subject headings (MeSH) features to generate a rough keyphrase representation of the overall impression. In the second stage, a Hypergraph-Enhanced Fine-Grained Alignment (HFG) stage that constructs hypergraphs for image patches and report annotations, modeling high-order relationships within each modality and performing hypergraph matching to capture semantic correlations between image regions and textual phrases. Finally,the coarse-grained visual features, generated MeSH representations, and visual hypergraph features are fed into a report decoder to produce the final medical report. Extensive experiments on public datasets demonstrate the effectiveness of DAMPER in generating comprehensive and accurate medical reports, outperforming state-of-the-art methods across various evaluation metrics.
