Table of Contents
Fetching ...

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun

TL;DR

MedCLIP tackles data scarcity and false negatives in medical image-text pretraining by decoupling image and text streams and leveraging medical knowledge to create a semantic similarity-based soft supervision. It expands training data combinatorially by pairing unpaired images and reports via a knowledge extractor that maps to UMLS entities, enabling (n+m)*(n+h) potential pairs. The semantic matching loss aligns image and text embeddings through soft targets derived from medical semantics, reducing noise from negative samples. Experiments on four chest-imaging datasets show strong zero-shot, fine-tuned, and retrieval performance with high data efficiency, outperforming baselines with far less paired data.

Abstract

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

TL;DR

MedCLIP tackles data scarcity and false negatives in medical image-text pretraining by decoupling image and text streams and leveraging medical knowledge to create a semantic similarity-based soft supervision. It expands training data combinatorially by pairing unpaired images and reports via a knowledge extractor that maps to UMLS entities, enabling (n+m)*(n+h) potential pairs. The semantic matching loss aligns image and text embeddings through soft targets derived from medical semantics, reducing noise from negative samples. Experiments on four chest-imaging datasets show strong zero-shot, fine-tuned, and retrieval performance with high data efficiency, outperforming baselines with far less paired data.

Abstract

Existing vision-text contrastive learning like CLIP aims to match the paired image and caption embeddings while pushing others apart, which improves representation transferability and supports zero-shot prediction. However, medical image-text datasets are orders of magnitude below the general images and captions from the internet. Moreover, previous methods encounter many false negatives, i.e., images and reports from separate patients probably carry the same semantics but are wrongly treated as negatives. In this paper, we decouple images and texts for multimodal contrastive learning thus scaling the usable training data in a combinatorial magnitude with low cost. We also propose to replace the InfoNCE loss with semantic matching loss based on medical knowledge to eliminate false negatives in contrastive learning. We prove that MedCLIP is a simple yet effective framework: it outperforms state-of-the-art methods on zero-shot prediction, supervised classification, and image-text retrieval. Surprisingly, we observe that with only 20K pre-training data, MedCLIP wins over the state-of-the-art method (using around 200K data). Our code is available at https://github.com/RyanWangZf/MedCLIP.
Paper Structure (17 sections, 9 equations, 5 figures, 5 tables)

This paper contains 17 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Zero-shot performance of MedCLIP, ConVIRT zhang2020contrastive, GLoRIA huang2021gloria when using different amounts of data for pre-training. ConVIRT and GLoRIA are trained on MIMIC-CXR (369K) and CheXpert (191K) dataset, respectively. Our method yields superior ACC than GLoRIA using near $1/10$ of pre-training data.
  • Figure 2: Demonstration of challenges in medical image-text contrastive learning. (1) Pre-training data only includes paired images and texts. However, many more image-only and text-only datasets are ignored. (2) False negatives appear. For an anchor image, previous methods treat paired texts (i.e., reports from the same patient's study) as positives and unpaired texts (i.e., reports from other patients' studies) as negatives. However, the negative texts can describe the same symptoms as the anchor texts.
  • Figure 3: The workflow of MedCLIP. The knowledge extraction module extracts medical entities from raw medical reports. Then, a semantic similarity matrix is built by comparing medical entities (from text) and raw labels (from images), which enables pairing arbitrary two separately sampled images and texts. The extracted image and text embeddings are paired to match the semantic similarity matrix.
  • Figure 4: Embeddings visualization of CheXpert5x200 images by CLIP and MedCLIP. Dimension reduced by t-SNE.
  • Figure 5: Visualization of the similarity distributions computed based on MedCLIP embeddings.