Table of Contents
Fetching ...

Advancing Medical Representation Learning Through High-Quality Data

Negin Baghbanzadeh, Adibvafa Fallahpour, Yasaman Parhizkar, Franklin Ogidi, Shuvendu Roy, Sajad Ashkezari, Vahid Reza Khazaie, Michael Colacci, Ali Etemad, Arash Afkanpour, Elham Dolatabadi

TL;DR

This work addresses the impact of data quality on medical vision-language learning by introducing Open-PMC, a high-quality PubMed Central–derived dataset featuring 2.2M image-text pairs with medical subfigures, subcaptions, in-text reference summaries, and modality annotations. The authors develop a multi-stage pipeline—image decomposition, caption segmentation, and textual augmentation—combined with GPT-4o-based contextualization to produce well-aligned image-text pairs for contrastive pretraining. Through extensive retrieval and zero-shot classification experiments across radiology, microscopy, and VLP modalities, Open-PMC demonstrates that data quality can surpass dataset size in driving representation quality, yielding distinct latent structures compared with prior medical datasets. The work also provides open-source resources (dataset, trained models, and code) to advance multimodal medical AI while highlighting areas for improvement, such as extending decomposition techniques to diverse modalities and implementing more robust data QA.

Abstract

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Advancing Medical Representation Learning Through High-Quality Data

TL;DR

This work addresses the impact of data quality on medical vision-language learning by introducing Open-PMC, a high-quality PubMed Central–derived dataset featuring 2.2M image-text pairs with medical subfigures, subcaptions, in-text reference summaries, and modality annotations. The authors develop a multi-stage pipeline—image decomposition, caption segmentation, and textual augmentation—combined with GPT-4o-based contextualization to produce well-aligned image-text pairs for contrastive pretraining. Through extensive retrieval and zero-shot classification experiments across radiology, microscopy, and VLP modalities, Open-PMC demonstrates that data quality can surpass dataset size in driving representation quality, yielding distinct latent structures compared with prior medical datasets. The work also provides open-source resources (dataset, trained models, and code) to advance multimodal medical AI while highlighting areas for improvement, such as extending decomposition techniques to diverse modalities and implementing more robust data QA.

Abstract

Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.

Paper Structure

This paper contains 27 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: LeftOpen-PMC-17M comprises 16.7 million image-caption pairs, which undergo rigorous quality curation to produce Open-PMC, including 2.2 million image-text pairs; images are medical subfigures, and texts are captions enriched with both the actual and summarized content of in-text references. Right The distribution (%) of each medical image modality within Open-PMC.
  • Figure 2: Zero-shot classification F1-scores across different VL models trained on datasets of varying sizes, evaluated on downstream tasks split by image modality. Each marker represents performance on an individual task, while the solid line indicates the mean performance across all tasks. Ours indicates Open-PMC.
  • Figure 3: Comparison of representation spaces of different VL models. (Top) MMD values between representations learned from Open-PMC versus PMC-15M and Biomedica. Red dots indicate observed MMD values, and blue bars are 99% bootstrap confidence interval of the permutation test. (Bottom) t-SNE visualizations of VL models embeddings, illustrating the structure and separation of the learned representation spaces.