Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning
Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour
TL;DR
The paper tackles the misalignment and shallow context in biomedical vision-language data by constructing Open-PMC-18M, a large-scale dataset of 18 million subfigure-text pairs with subcaptions and in-text contextual summaries. It introduces a scalable pipeline using transformer-based subfigure detection trained on synthetic data, automated subcaption extraction, and context-aware text enrichment, followed by careful refinement to ensure medical relevance. Through extensive pretraining and evaluation across radiology, microscopy, and visible-light photography, the approach yields state-of-the-art retrieval, stronger zero-shot classification, and robust representations, demonstrating the value of high-fidelity supervision over mere scale. The authors also provide ablations, representation analyses, and release the dataset, models, and code to support reproducible research in biomedical vision-language modeling.
Abstract
In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.
