MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

Biao Wu; Yutong Xie; Zeyu Zhang; Minh Hieu Phan; Qi Chen; Ling Chen; Qi Wu

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

Biao Wu, Yutong Xie, Zeyu Zhang, Minh Hieu Phan, Qi Chen, Ling Chen, Qi Wu

TL;DR

MMCLIP tackles two core medical VLP challenges: limited data quality for pathologies and underutilization of both paired and unpaired data. It introduces AttMIM to perform attention-guided masking on images and EntMLM to mask medically relevant entities in reports, both guided by cross-modal interactions and disease prompts; together with standard contrastive alignment, these enable effective learning from paired and unpaired data. Pretraining on MIMIC-CXR and PadChest, MMCLIP achieves state-of-the-art zero-shot and fine-tuning performance across five medical datasets, demonstrating strong generalization and data efficiency. The work offers a practical framework for scalable medical VLP that can leverage unpaired data and disease priors to improve diagnostic representation learning.

Abstract

Vision-and-language pretraining (VLP) in the medical field utilizes contrastive learning on image-text pairs to achieve effective transfer across tasks. Yet, current VLP approaches with the masked modeling strategy face two challenges when applied to the medical domain. First, current models struggle to accurately reconstruct key pathological features due to the scarcity of medical data. Second, most methods only adopt either paired image-text or image-only data, failing to exploit the combination of both paired and unpaired data. To this end, this paper proposes the MMCLIP (Masked Medical Contrastive Language-Image Pre-Training) framework to enhance pathological learning and feature learning via unpaired data. First, we introduce the attention-masked image modeling (AttMIM) and entity-driven masked language modeling module (EntMLM), which learns to reconstruct pathological visual and textual tokens via multi-modal feature interaction, thus improving medical-enhanced features. The AttMIM module masks a portion of the image features that are highly responsive to textual features. This allows MMCLIP to improve the reconstruction of highly similar image data in medicine efficiency. Second, our MMCLIP capitalizes unpaired data to enhance multimodal learning by introducing disease-kind prompts. The experimental results show that MMCLIP achieves SOTA for zero-shot and fine-tuning classification performance on five datasets. Our code will be available at https://github.com/AIGeeksGroup/MMCLIP.

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

TL;DR

Abstract

Paper Structure (30 sections, 17 equations, 8 figures, 4 tables)

This paper contains 30 sections, 17 equations, 8 figures, 4 tables.

Introduction
Related Works
Medical Vision Language Pretraining
Mask Image or Language Modelling
Attention Masked modelling
Methodology
Image and Text Encoders
Image encoder
Text encoder
Attention-masked Image Modeling Module
Attentions extraction
Attention-based mask generation and blending.
Entity-driven Masked Language Modeling Module
Objective Functions
Objective function for image-report alignment.
...and 15 more sections

Figures (8)

Figure 1: Modules A and B illustrate the distinctions between the existing VLP training frameworks and our proposed approach. Methods in module A only use the random mask strategy, and the reconstruction process does not involve multi-modal interaction. Module B shows that our method takes advantage of multi-modal interaction and uses an attention mechanism to guide the mask.
Figure 2: Our MMCLIP framework builds upon CLIP, integrating MIM and MLM modules, with a redesigned masking strategy to enhance model representation. The key contributions include three simple yet effective designs: generating masks through feature interaction, fusing masks to augment adaptability, and refining the text masking strategy with Medical Entity Recognition. Incorporating these designs into our multimodal pre-training framework significantly boosts the model's zero-shot performance.
Figure 3: The working mechanism of function M: The features obtained from different text data correspond to image features through the Cross Attention layer, querying the corresponding activated features to identify different regions of interest. These regions of interest are then combined, and a 25% subset is randomly selected to obtain the final Mask result. Unlike MediM, which only uses dot products to activate cross-modal response features, MMCLIP employs cross-attention for deeper feature activation. This allows the model to induce more precise key areas from text features.
Figure 4: Different masking strategies result in varying concerns regarding mask application. Masks that are specifically tailored to the pathological characteristics of the lesion area are more effective than those applied randomly.
Figure 5: Illustration of entity-driven mask generation, which combines the CausalMask with the mask guided by the medical NER to generate the final result.
...and 3 more figures

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

TL;DR

Abstract

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

Authors

TL;DR

Abstract

Table of Contents

Figures (8)