Table of Contents
Fetching ...

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

Negin Baghbanzadeh, Mohammed Saidul Islam, Sajad Ashkezari, Elham Dolatabadi, Arash Afkanpour

TL;DR

The paper tackles the misalignment and shallow context in biomedical vision-language data by constructing Open-PMC-18M, a large-scale dataset of 18 million subfigure-text pairs with subcaptions and in-text contextual summaries. It introduces a scalable pipeline using transformer-based subfigure detection trained on synthetic data, automated subcaption extraction, and context-aware text enrichment, followed by careful refinement to ensure medical relevance. Through extensive pretraining and evaluation across radiology, microscopy, and visible-light photography, the approach yields state-of-the-art retrieval, stronger zero-shot classification, and robust representations, demonstrating the value of high-fidelity supervision over mere scale. The authors also provide ablations, representation analyses, and release the dataset, models, and code to support reproducible research in biomedical vision-language modeling.

Abstract

In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

Open-PMC-18M: A High-Fidelity Large Scale Medical Dataset for Multimodal Representation Learning

TL;DR

The paper tackles the misalignment and shallow context in biomedical vision-language data by constructing Open-PMC-18M, a large-scale dataset of 18 million subfigure-text pairs with subcaptions and in-text contextual summaries. It introduces a scalable pipeline using transformer-based subfigure detection trained on synthetic data, automated subcaption extraction, and context-aware text enrichment, followed by careful refinement to ensure medical relevance. Through extensive pretraining and evaluation across radiology, microscopy, and visible-light photography, the approach yields state-of-the-art retrieval, stronger zero-shot classification, and robust representations, demonstrating the value of high-fidelity supervision over mere scale. The authors also provide ablations, representation analyses, and release the dataset, models, and code to support reproducible research in biomedical vision-language modeling.

Abstract

In biomedical vision-language modeling, datasets are typically mined from scientific literature, pairing compound figures with captions that are short, context-dependent, and oftern partially informative. Prior work on subfigure extraction has been limited in both dataset size and generalizability. In addition, no existing effort has incorporated rich medical context in image-text pairs. We revisit data curation as a foundational component of effective biomedical representation learning. Our data curation process integrates transformer-based subfigure detection, subcaption extraction, and contextual text enrichment derived from inline references. Our subfigure extraction model, trained on a corpus of 500,000 compound figures, achieves state-of-the-art performance on real and synthetic benchmarks. Using this process, we curate and release Open-PMC-18M, a large-scale high-fidelity biomedical dataset comprising 18 million image-text pairs, spanning radiology, microscopy, and visible light photography. We train vision-language models on our dataset and perform extensive evaluation on 6 retrieval and 19 zero-shot classification tasks across three major modalities. The models trained on our dataset set a new state-of-the-art results in medical representation learning. We release our dataset, models, and code to support reproducible benchmarks and further study into biomedical vision-language modeling and representation learning.

Paper Structure

This paper contains 59 sections, 2 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: (a) Compound figure, and corresponding full caption (subcaption for subfigure 'C' is highlighted) and in-text reference related to the figure from Biomedica dataset; (b) Extracted subfigure image from Open-PMC-18M, and corresponding extracted subcaption and summary of in-text reference; (c) Average retrieval performance: our model vs. other models (top), and Retrieval results on MIMIC-CXR of model versions trained in different settings: subfigure only vs. subfigure + subcaption vs. subfigure + subcaption + summary (bottom). The compound image in (a) is originally from rivera2015matr.
  • Figure 2: (a) Overview of the Open-PMC-18M construction process comprising of the following key stages: (b) Subfigure Extraction pipeline for creating synthetic compound figures that are used to train the DAB-DETR model. (c) Example subfigures extracted using the DAB-DETR model from ImageCLEF dataset.
  • Figure 3: Robustness evaluation across retrieval benchmarks. Robustness is quantified as the ratio of performance under visual perturbations (brightness, shift, rotation, flip, zoom) to performance on the original test set. Higher values indicate greater stability to perturbations.
  • Figure 4: Dataset composition and text length statistics. (a) Distribution of image modalities in Open-PMC-18M. (b) Token distribution for full captions. (c) Token distribution for subcaptions. (d) Token distribution for figure-context summaries.
  • Figure 5: Recall@200 for the last epochs of the trained model on the validation data
  • ...and 5 more figures