Table of Contents
Fetching ...

Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation

Yitian Tao, Liyan Ma, Jing Yu, Han Zhang

TL;DR

The paper tackles automatic radiology report generation by addressing cross‑modal gaps: disease‑related information is sparse across image and text, hindering learning of their latent relationships. It introduces MCSAM, a memory‑based framework that initializes a long‑term cross‑modal memory bank via Optimal Transport, retrieves memory to consolidate visual/text features, and enforces semantic consistency with SAM through a contrastive objective. The report generator then leverages semantic embeddings and learnable prompts to produce fluent, clinically accurate reports, achieving state‑of‑the‑art results on MIMIC‑CXR and strong performance on IU‑Xray. This approach improves interpretability, reduces data bias, and offers a scalable path toward clinically robust radiology report generation. Mathematical constructs such as $N^m$, $D_M$, and the loss $L = L_{gen} + L_{align}$ anchor the method’s formalization and training dynamics.

Abstract

Generating radiology reports automatically reduces the workload of radiologists and helps the diagnoses of specific diseases. Many existing methods take this task as modality transfer process. However, since the key information related to disease accounts for a small proportion in both image and report, it is hard for the model to learn the latent relation between the radiology image and its report, thus failing to generate fluent and accurate radiology reports. To tackle this problem, we propose a memory-based cross-modal semantic alignment model (MCSAM) following an encoder-decoder paradigm. MCSAM includes a well initialized long-term clinical memory bank to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. To ensure the semantic consistency of the retrieved cross modal prior knowledge, a cross-modal semantic alignment module (SAM) is proposed. SAM is also able to generate semantic visual feature embeddings which can be added to the decoder and benefits report generation. More importantly, to memorize the state and additional information while generating reports with the decoder, we use learnable memory tokens which can be seen as prompts. Extensive experiments demonstrate the promising performance of our proposed method which generates state-of-the-art performance on the MIMIC-CXR dataset.

Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation

TL;DR

The paper tackles automatic radiology report generation by addressing cross‑modal gaps: disease‑related information is sparse across image and text, hindering learning of their latent relationships. It introduces MCSAM, a memory‑based framework that initializes a long‑term cross‑modal memory bank via Optimal Transport, retrieves memory to consolidate visual/text features, and enforces semantic consistency with SAM through a contrastive objective. The report generator then leverages semantic embeddings and learnable prompts to produce fluent, clinically accurate reports, achieving state‑of‑the‑art results on MIMIC‑CXR and strong performance on IU‑Xray. This approach improves interpretability, reduces data bias, and offers a scalable path toward clinically robust radiology report generation. Mathematical constructs such as , , and the loss anchor the method’s formalization and training dynamics.

Abstract

Generating radiology reports automatically reduces the workload of radiologists and helps the diagnoses of specific diseases. Many existing methods take this task as modality transfer process. However, since the key information related to disease accounts for a small proportion in both image and report, it is hard for the model to learn the latent relation between the radiology image and its report, thus failing to generate fluent and accurate radiology reports. To tackle this problem, we propose a memory-based cross-modal semantic alignment model (MCSAM) following an encoder-decoder paradigm. MCSAM includes a well initialized long-term clinical memory bank to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. To ensure the semantic consistency of the retrieved cross modal prior knowledge, a cross-modal semantic alignment module (SAM) is proposed. SAM is also able to generate semantic visual feature embeddings which can be added to the decoder and benefits report generation. More importantly, to memorize the state and additional information while generating reports with the decoder, we use learnable memory tokens which can be seen as prompts. Extensive experiments demonstrate the promising performance of our proposed method which generates state-of-the-art performance on the MIMIC-CXR dataset.
Paper Structure (16 sections, 22 equations, 8 figures, 8 tables)

This paper contains 16 sections, 22 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of MCSAM, which can be divided into three parts: a cross modal memory bank, a cross modal semantic alignment module (SAM) and a report generator. Cross modal memory bank is used to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. Cross modal semantic alignment module (SAM) is proposed to ensure the semantic consistency of the retrieved cross modal prior knowledge. The report generator generates reports depending on the retrieved memory and learnable prompts.
  • Figure 2: The report generator consists of a visual encoder and a report decoder with learnable prompts.
  • Figure 3: An example to illustrate the quality of the report generated by our model. It can be seen that our model can generate fluent report and have high factual completeness. For example, our model correctly describes "tracheostomy tube is in standard position ", "heart size remains moderately enlarged" and "no pleural effusion or pneumothorax is present".
  • Figure 4: The ROUGE-L and BLEU-4 scores during training, the blue line and the red line denote the performance scores of our method (with the help of OT-initialized memory bank) and baseline method during training, respectively.
  • Figure 5: The visualization of memory retrieval process, where the blocks represent the memory vectors in the memory bank, the numbers denote the contribution of specific memory vectors to the construction of prior knowledge, and the red arrows and the black dotted arrows denote the retrieval process of MCSAM and BASE, respectively.
  • ...and 3 more figures