Table of Contents
Fetching ...

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

Zhe Li, Laurence T. Yang, Bocheng Ren, Xin Nie, Zhangyang Gao, Cheng Tan, Stan Z. Li

TL;DR

MLIP tackles the scarcity of annotated medical data by unsupervised pre-training that merges radiology reports with images through multi-granularity image-text contrastive learning. It introduces a divergence encoder for data augmentation and leverages domain knowledge from UMLS to guide local and category-level alignments, achieving improved generalization across object detection, segmentation, and classification. The approach combines global IT contrast, local token-knowledge-patch alignment, and knowledge-guided prototype clustering, along with image-text matching and text-swapping proxies, yielding state-of-the-art results especially in zero-shot and low-data regimes. This framework highlights the practical value of knowledge-guided multimodal pre-training for robust medical representation learning and downstream AI-assisted diagnostics.

Abstract

The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However, existing research overlooks the multi-granularity nature of medical visual representation and lacks suitable contrastive learning techniques to improve the models' generalizability across different granularities, leading to the underutilization of image-text information. To address this, we propose MLIP, a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Experimental evaluations reveal the efficacy of our model in enhancing transfer performance for tasks such as image classification, object detection, and semantic segmentation. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.

MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning

TL;DR

MLIP tackles the scarcity of annotated medical data by unsupervised pre-training that merges radiology reports with images through multi-granularity image-text contrastive learning. It introduces a divergence encoder for data augmentation and leverages domain knowledge from UMLS to guide local and category-level alignments, achieving improved generalization across object detection, segmentation, and classification. The approach combines global IT contrast, local token-knowledge-patch alignment, and knowledge-guided prototype clustering, along with image-text matching and text-swapping proxies, yielding state-of-the-art results especially in zero-shot and low-data regimes. This framework highlights the practical value of knowledge-guided multimodal pre-training for robust medical representation learning and downstream AI-assisted diagnostics.

Abstract

The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However, existing research overlooks the multi-granularity nature of medical visual representation and lacks suitable contrastive learning techniques to improve the models' generalizability across different granularities, leading to the underutilization of image-text information. To address this, we propose MLIP, a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Experimental evaluations reveal the efficacy of our model in enhancing transfer performance for tasks such as image classification, object detection, and semantic segmentation. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
Paper Structure (24 sections, 21 equations, 2 figures, 5 tables)

This paper contains 24 sections, 21 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Detailed illustration of false negatives in medical image-text. Conventional approaches consider false negative samples as negatives that are distant from positive samples in the lower left corner. In contrast, in the lower right corner, our proposed method distinguishes false negatives from negatives, effectively bringing them closer to positives.
  • Figure 2: Our model architecture employs global, local, and category-level image-text contrastive learning. Given medical images and reports as inputs, we extract global features and local features for each modality using image and text encoders. We leverage global features for global image-text contrastive learning, while the local features are aligned with domain-specific knowledge from UMLS to achieve fine-grained image-text alignment. Through tucker fusion and cross-modal attention mechanisms, we combine the image, text, and knowledge representations, facilitating category-level prototype contrastive learning. Furthermore, to enhance feature diversity, we introduce a divergence encoder as a data augmentation strategy, generating similar yet distinct features. This enables global contrastive learning between images and augmented text, as well as between text and augmented images.