Table of Contents
Fetching ...

Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

Shuvendu Roy, Yasaman Parhizkar, Franklin Ogidi, Vahid Reza Khazaie, Michael Colacci, Ali Etemad, Elham Dolatabadi, Arash Afkanpour

TL;DR

The paper benchmarks eight vision–language contrastive methods for medical multimodal representation learning using 2.8M image–text pairs across radiology, histopathology, and beyond. It shows that transferring general-domain representations is viable with partial image encoder freezing, while unimodal training does not reliably boost multimodal medical tasks, and that learning fine-grained features yields the strongest gains. The work provides a unified, large-scale evaluation framework and releases code to promote reproducibility and further research in medical foundation models. Overall, the findings offer practical guidance for building efficient, transferable medical visual–text models under data constraints, with implications for downstream tasks like retrieval and VQA.

Abstract

We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.

Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning

TL;DR

The paper benchmarks eight vision–language contrastive methods for medical multimodal representation learning using 2.8M image–text pairs across radiology, histopathology, and beyond. It shows that transferring general-domain representations is viable with partial image encoder freezing, while unimodal training does not reliably boost multimodal medical tasks, and that learning fine-grained features yields the strongest gains. The work provides a unified, large-scale evaluation framework and releases code to promote reproducibility and further research in medical foundation models. Overall, the findings offer practical guidance for building efficient, transferable medical visual–text models under data constraints, with implications for downstream tasks like retrieval and VQA.

Abstract

We perform a comprehensive benchmarking of contrastive frameworks for learning multimodal representations in the medical domain. Through this study, we aim to answer the following research questions: (i) How transferable are general-domain representations to the medical domain? (ii) Is multimodal contrastive training sufficient, or does it benefit from unimodal training as well? (iii) What is the impact of feature granularity on the effectiveness of multimodal medical representation learning? To answer these questions, we investigate eight contrastive learning approaches under identical training setups, and train them on 2.8 million image-text pairs from four datasets, and evaluate them on 25 downstream tasks, including classification (zero-shot and linear probing), image-to-text and text-to-image retrieval, and visual question-answering. Our findings suggest a positive answer to the first question, a negative answer to the second question, and the benefit of learning fine-grained features. Finally, we make our code publicly available.
Paper Structure (20 sections, 6 equations, 6 figures, 23 tables)

This paper contains 20 sections, 6 equations, 6 figures, 23 tables.

Figures (6)

  • Figure 1: Illustration of eight contrastive learning approaches studied in the paper.
  • Figure 2: Samples from the datasets used in this study.
  • Figure 3: F1 score for linear probing on the study of transferability. Here, Image Partial Freeze outperforms the baseline on 6 datasets, while Text Partial Freeze, Text Full Freeze and Image Full Freeze perform better than the baseline in 5, 3, and 1 datasets, respectively.
  • Figure 4: F1 score for linear probing on the study of unimodal learning. Here, Masked CL performs better than the baseline on two datasets.
  • Figure 5: F1 score for linear probing on the study of feature granularity. Here, Image Captioning and Fast CL outperform the baseline on 4 datasets each.
  • ...and 1 more figures