Table of Contents
Fetching ...

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

Muhammad Uzair Khattak, Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

TL;DR

The paper tackles the data bottleneck in medical vision-language modeling by introducing UniMed, a large-scale open dataset of 5.3 million image-text pairs across six medical modalities. It pairs this with UniMed-CLIP, a unified VLM trained via a CLIP objective and enhanced by multi-captioning of label-only data generated with an LLM. Empirical results show strong zero-shot performance across 21 datasets and modalities, and robust linear-probing transfer with limited data, outperforming generalist medical VLMs and approaching modality-specific baselines. The work emphasizes data-centric design and provides open-source resources to advance open, multi-modal medical foundation models.

Abstract

Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at https://github.com/mbzuai-oryx/UniMed-CLIP.

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

TL;DR

The paper tackles the data bottleneck in medical vision-language modeling by introducing UniMed, a large-scale open dataset of 5.3 million image-text pairs across six medical modalities. It pairs this with UniMed-CLIP, a unified VLM trained via a CLIP objective and enhanced by multi-captioning of label-only data generated with an LLM. Empirical results show strong zero-shot performance across 21 datasets and modalities, and robust linear-probing transfer with limited data, outperforming generalist medical VLMs and approaching modality-specific baselines. The work emphasizes data-centric design and provides open-source resources to advance open, multi-modal medical foundation models.

Abstract

Vision-Language Models (VLMs) trained via contrastive learning have achieved notable success in natural image tasks. However, their application in the medical domain remains limited due to the scarcity of openly accessible, large-scale medical image-text datasets. Existing medical VLMs either train on closed-source proprietary or relatively small open-source datasets that do not generalize well. Similarly, most models remain specific to a single or limited number of medical imaging domains, again restricting their applicability to other modalities. To address this gap, we introduce UniMed, a large-scale, open-source multi-modal medical dataset comprising over 5.3 million image-text pairs across six diverse imaging modalities: X-ray, CT, MRI, Ultrasound, Pathology, and Fundus. UniMed is developed using a data-collection framework that leverages Large Language Models (LLMs) to transform modality-specific classification datasets into image-text formats while incorporating existing image-text data from the medical domain, facilitating scalable VLM pretraining. Using UniMed, we trained UniMed-CLIP, a unified VLM for six modalities that significantly outperforms existing generalist VLMs and matches modality-specific medical VLMs, achieving notable gains in zero-shot evaluations. For instance, UniMed-CLIP improves over BiomedCLIP (trained on proprietary data) by an absolute gain of +12.61, averaged over 21 datasets, while using 3x less training data. To facilitate future research, we release UniMed dataset, training codes, and models at https://github.com/mbzuai-oryx/UniMed-CLIP.

Paper Structure

This paper contains 18 sections, 4 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Zero-shot medical image recognition results. Averaged results over 21 datasets from 6 modalities: CT, MRI, US, X-ray, Histopathology, and Retinal fundus. UniMed-CLIP trained on our open-source UniMed dataset developed using publicly available data sources shows notable gains, compared to existing medical contrastive VLMs including MedCLIP wang2022medclip, MM-Retinal wu2024mm, QuiltNet ikezogwo2024quilt, BiomedCLIP zhang2023biomedclip and PMC-CLIP lin2023pmc.
  • Figure 2: Overview of UniMed dataset and UniMed-CLIP VLM.(Left): We develop a medical pretraining dataset, UniMed by meticulously collecting publicly available label-only (uni-modal) image datasets and image-text (multi-modal) datasets. (Middle): We utilize LLM-in-the-loop framework to convert label-only datasets into pseudo-image-text pairs where each image is paired with multiple captions. Both pseudo-image-text pairs and already available image-text pairs are used to create the UniMed dataset, which is a) open-source, b) large-scale and, c) covers diverse medical modalities. (Right): Using UniMed dataset, we train UniMed-CLIP within a contrastive language-image pretraining paradigm. The resulting VLM performs well in zero-shot evaluations across various medical modalities.
  • Figure 3: Public datasets used in UniMed: It comprises both general medical domain datasets with diverse modalities and modality-specific datasets. The data spans publicly available unimodal (image-label) and multimodal (image-text) formats, enabling broad applicability across various medical imaging tasks.
  • Figure 4: Label to Caption Generation Prompting: We perform template caption generation using an LLM, which leverages available label information (Label Info Triplet). This approach ensures diverse captions in agreement with ground-truth label information in biomedical terminologies. In each training iteration, each image is paired with a randomly sampled caption from its support set.
  • Figure 5: Caption length distribution: Average length of captions (in words) of different datasets used in UniMed.
  • ...and 7 more figures