Table of Contents
Fetching ...

MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning

Jiarun Liu, Hong-Yu Zhou, Cheng Li, Weijian Huang, Hao Yang, Yong Liang, Shanshan Wang

TL;DR

MLIP tackles data inefficiency and fine-grained local alignment in medical language–image pre-training by integrating masked contrastive learning with a semantic integrity estimator, and a novel sentence–patch matching mechanism based on optimal transport. The framework uses a ViT-B image encoder and Bio_ClinicalBERT text encoder to jointly learn patch- and sentence-level representations, including a masked image prediction task to preserve semantics. Empirical results on MIMIC-CXR show MLIP achieves state-of-the-art zero-/few-shot classification and few-shot segmentation across RSNA and SIIM, with notable gains from dense local supervision and semantic-aware weighting. The work demonstrates a practical, data-efficient approach for medical LIP with strong potential for broader clinical applications and modalities.

Abstract

Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in natural datasets. Besides, medical image-text pairs often involve numerous complex fine-grained correspondences. This paper aims to enhance the data efficiency by introducing multiple-to-multiple local relationship modeling to capture denser supervisions. More specifically, we propose a Medical Language-Image Pre-training (MLIP) framework, which exploits the limited image-text medical data more efficiently through patch-sentence matching. Furthermore, we introduce a masked contrastive learning strategy with semantic integrity estimation to reduce redundancy in images while preserving the underlying semantics. Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.

MLIP: Medical Language-Image Pre-training with Masked Local Representation Learning

TL;DR

MLIP tackles data inefficiency and fine-grained local alignment in medical language–image pre-training by integrating masked contrastive learning with a semantic integrity estimator, and a novel sentence–patch matching mechanism based on optimal transport. The framework uses a ViT-B image encoder and Bio_ClinicalBERT text encoder to jointly learn patch- and sentence-level representations, including a masked image prediction task to preserve semantics. Empirical results on MIMIC-CXR show MLIP achieves state-of-the-art zero-/few-shot classification and few-shot segmentation across RSNA and SIIM, with notable gains from dense local supervision and semantic-aware weighting. The work demonstrates a practical, data-efficient approach for medical LIP with strong potential for broader clinical applications and modalities.

Abstract

Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs. However, the number of image-text pairs in medical datasets is usually orders of magnitude smaller than that in natural datasets. Besides, medical image-text pairs often involve numerous complex fine-grained correspondences. This paper aims to enhance the data efficiency by introducing multiple-to-multiple local relationship modeling to capture denser supervisions. More specifically, we propose a Medical Language-Image Pre-training (MLIP) framework, which exploits the limited image-text medical data more efficiently through patch-sentence matching. Furthermore, we introduce a masked contrastive learning strategy with semantic integrity estimation to reduce redundancy in images while preserving the underlying semantics. Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
Paper Structure (13 sections, 10 equations, 3 figures, 4 tables)

This paper contains 13 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Complex multiple-to-multiple correspondences between medical images and text. An image region can correspond to multiple disease entities, while one disease entity could relate to multiple image regions simultaneously.
  • Figure 2: The framework of MLIP.
  • Figure 3: Visualization of (a) learned $\phi$ and (b) heatmap for a given prompt. The black box in (b) is the corresponding ground-truth. We transform $\phi$ into a square for display.