Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation
Yitian Tao, Liyan Ma, Jing Yu, Han Zhang
TL;DR
The paper tackles automatic radiology report generation by addressing cross‑modal gaps: disease‑related information is sparse across image and text, hindering learning of their latent relationships. It introduces MCSAM, a memory‑based framework that initializes a long‑term cross‑modal memory bank via Optimal Transport, retrieves memory to consolidate visual/text features, and enforces semantic consistency with SAM through a contrastive objective. The report generator then leverages semantic embeddings and learnable prompts to produce fluent, clinically accurate reports, achieving state‑of‑the‑art results on MIMIC‑CXR and strong performance on IU‑Xray. This approach improves interpretability, reduces data bias, and offers a scalable path toward clinically robust radiology report generation. Mathematical constructs such as $N^m$, $D_M$, and the loss $L = L_{gen} + L_{align}$ anchor the method’s formalization and training dynamics.
Abstract
Generating radiology reports automatically reduces the workload of radiologists and helps the diagnoses of specific diseases. Many existing methods take this task as modality transfer process. However, since the key information related to disease accounts for a small proportion in both image and report, it is hard for the model to learn the latent relation between the radiology image and its report, thus failing to generate fluent and accurate radiology reports. To tackle this problem, we propose a memory-based cross-modal semantic alignment model (MCSAM) following an encoder-decoder paradigm. MCSAM includes a well initialized long-term clinical memory bank to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. To ensure the semantic consistency of the retrieved cross modal prior knowledge, a cross-modal semantic alignment module (SAM) is proposed. SAM is also able to generate semantic visual feature embeddings which can be added to the decoder and benefits report generation. More importantly, to memorize the state and additional information while generating reports with the decoder, we use learnable memory tokens which can be seen as prompts. Extensive experiments demonstrate the promising performance of our proposed method which generates state-of-the-art performance on the MIMIC-CXR dataset.
