Table of Contents
Fetching ...

Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval

Zeqiang Wei, Kai Jin, Xiuzhuang Zhou

TL;DR

This work tackles cross-modal medical image-report retrieval by addressing modality heterogeneity with a Masked Contrastive Reconstruction (MCR) framework that uses masked data for both contrastive learning and reconstruction, reducing task interference and computational demands. A dedicated modality alignment strategy, Mapping before Aggregation (MbA), maps features into a common space prior to aggregation to preserve fine-grained information and improve alignment. Experiments on the MIMIC-CXR dataset show state-of-the-art retrieval performance and substantial efficiency gains, with memory usage and training time dramatically reduced. The approach enables scalable, high-fidelity medical cross-modal retrieval and can benefit downstream generation and diagnosis tasks.

Abstract

Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks. Eliminating heterogeneity between different modalities to enhance semantic consistency is the key challenge of this task. The current Vision-Language Pretraining (VLP) models, with cross-modal contrastive learning and masked reconstruction as joint training tasks, can effectively enhance the performance of cross-modal retrieval. This framework typically employs dual-stream inputs, using unmasked data for cross-modal contrastive learning and masked data for reconstruction. However, due to task competition and information interference caused by significant differences between the inputs of the two proxy tasks, the effectiveness of representation learning for intra-modal and cross-modal features is limited. In this paper, we propose an efficient VLP framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks. This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time. Moreover, we introduce a new modality alignment strategy named Mapping before Aggregation (MbA). Unlike previous methods, MbA maps different modalities to a common feature space before conducting local feature aggregation, thereby reducing the loss of fine-grained semantic information necessary for improved modality alignment. Qualitative and quantitative experiments conducted on the MIMIC-CXR dataset validate the effectiveness of our approach, demonstrating state-of-the-art performance in medical cross-modal retrieval tasks.

Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval

TL;DR

This work tackles cross-modal medical image-report retrieval by addressing modality heterogeneity with a Masked Contrastive Reconstruction (MCR) framework that uses masked data for both contrastive learning and reconstruction, reducing task interference and computational demands. A dedicated modality alignment strategy, Mapping before Aggregation (MbA), maps features into a common space prior to aggregation to preserve fine-grained information and improve alignment. Experiments on the MIMIC-CXR dataset show state-of-the-art retrieval performance and substantial efficiency gains, with memory usage and training time dramatically reduced. The approach enables scalable, high-fidelity medical cross-modal retrieval and can benefit downstream generation and diagnosis tasks.

Abstract

Cross-modal medical image-report retrieval task plays a significant role in clinical diagnosis and various medical generative tasks. Eliminating heterogeneity between different modalities to enhance semantic consistency is the key challenge of this task. The current Vision-Language Pretraining (VLP) models, with cross-modal contrastive learning and masked reconstruction as joint training tasks, can effectively enhance the performance of cross-modal retrieval. This framework typically employs dual-stream inputs, using unmasked data for cross-modal contrastive learning and masked data for reconstruction. However, due to task competition and information interference caused by significant differences between the inputs of the two proxy tasks, the effectiveness of representation learning for intra-modal and cross-modal features is limited. In this paper, we propose an efficient VLP framework named Masked Contrastive and Reconstruction (MCR), which takes masked data as the sole input for both tasks. This enhances task connections, reducing information interference and competition between them, while also substantially decreasing the required GPU memory and training time. Moreover, we introduce a new modality alignment strategy named Mapping before Aggregation (MbA). Unlike previous methods, MbA maps different modalities to a common feature space before conducting local feature aggregation, thereby reducing the loss of fine-grained semantic information necessary for improved modality alignment. Qualitative and quantitative experiments conducted on the MIMIC-CXR dataset validate the effectiveness of our approach, demonstrating state-of-the-art performance in medical cross-modal retrieval tasks.
Paper Structure (15 sections, 5 equations, 5 figures, 5 tables)

This paper contains 15 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The VLP framework (a) takes as input both masked and unmasked data, and uses unmasked features for cross-modal contrastive learning and masked features for reconstruction. Our MCR framework (b) takes as input masked data alone, and uses masked features for both cross-modal contrastive learning and masked reconstruction.
  • Figure 2: The overall pipeline of the MCR framework. It takes as input the masked chest X-ray images and reports with mask rates of 50% and 25%, respectively. Masked image features $\hat{f}^v$ and Masked report features $\hat{f}^r$ are extracted using the image encoder $E_v$ and text encoder $E_r$, respectively. Subsequently, $\hat{f}^v$ and $\hat{f}^r$ are mapped into a common semantic space using $F_{v \to s}$ and $F_{r \to s}$, followed by feature aggregation to obtain modality-aligned features $S^v$ and $S^r$. Additionally, $\hat{f}^v$ and $\hat{f}^r$ are used to reconstruct the original inputs by $D_v$ and $D_r$, respectively. $\mathcal{L}_{mim}$ and $\mathcal{L}_{mrm}$ respectively denote the masked reconstruction loss for chest X-ray image and corresponding report, and $\mathcal{L}_{vrc}$ denotes the loss of cross-modal consistency.
  • Figure 3: The TopK (K $\in$ [1, 10]) retrieval results in terms of NLG metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) for different cross-modal retrieval methods. Dashed and solid lines denote the subtask I $\to$ R and the subtask R $\to$ I, respectively.
  • Figure 4: Visualization of the modality gap for our MCR with AbM (a) and MbA (b). The first two principal components of the validation set data are shown using t-SNE, where points in red and blue denote the embedded chest X-ray images and corresponding reports, respectively.
  • Figure 5: The Top-3 retrieval results using different alignment strategies. Text in yellow represents the medical semantics contained in the ground truth, text in green denotes the medical semantic content similar to the query sample, and text in pink represents the content irrelative to the query sample.