Advancing Medical Representation Learning Through High-Quality Data
Negin Baghbanzadeh, Adibvafa Fallahpour, Yasaman Parhizkar, Franklin Ogidi, Shuvendu Roy, Sajad Ashkezari, Vahid Reza Khazaie, Michael Colacci, Ali Etemad, Arash Afkanpour, Elham Dolatabadi
TL;DR
This work addresses the impact of data quality on medical vision-language learning by introducing Open-PMC, a high-quality PubMed Central–derived dataset featuring 2.2M image-text pairs with medical subfigures, subcaptions, in-text reference summaries, and modality annotations. The authors develop a multi-stage pipeline—image decomposition, caption segmentation, and textual augmentation—combined with GPT-4o-based contextualization to produce well-aligned image-text pairs for contrastive pretraining. Through extensive retrieval and zero-shot classification experiments across radiology, microscopy, and VLP modalities, Open-PMC demonstrates that data quality can surpass dataset size in driving representation quality, yielding distinct latent structures compared with prior medical datasets. The work also provides open-source resources (dataset, trained models, and code) to advance multimodal medical AI while highlighting areas for improvement, such as extending decomposition techniques to diverse modalities and implementing more robust data QA.
Abstract
Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.
