BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Alejandro Lozano; Min Woo Sun; James Burgess; Liangyu Chen; Jeffrey J Nirschl; Jeffrey Gu; Ivan Lopez; Josiah Aklilu; Austin Wolfgang Katzer; Collin Chiu; Anita Rau; Xiaohan Wang; Yuhui Zhang; Alfred Seunghoon Song; Robert Tibshirani; Serena Yeung-Levy

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy

TL;DR

BIOMEDICA tackles the lack of open, richly annotated biomedical multimodal data by delivering a scalable framework to extract and annotate the entire PMC-OA corpus into a 24M image-caption dataset with 27 metadata fields. It enables streaming continual pretraining of CLIP-style models (BMCA-CLIP) on this dataset, achieving state-of-the-art zero-shot and retrieval performance across 40 biomedical tasks with substantially reduced compute and data requirements. The work introduces a clinician-guided taxonomy and an AI-assisted clustering pipeline to label images at scale, and it provides a reproducible pipeline, dataset, and models via public hosting and streaming capabilities. Overall, BIOMEDICA offers a practical foundation for advancing vision-language modeling in diverse biomedical domains and fosters reproducibility and collaboration in biomedical AI research.

Abstract

The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally. On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

TL;DR

Abstract

Paper Structure (42 sections, 7 equations, 10 figures, 20 tables)

This paper contains 42 sections, 7 equations, 10 figures, 20 tables.

Introduction
Related Work
BIOMEDICA Data Curation Process
Data Extraction
Concept Labeling
Data Serialization
BIOMEDICA Dataset Description
Evaluation Benchmark
Experiments
Results
Limitations
Conclusion
Acknowledgments
Dataset Description
Dataset Statistics
...and 27 more sections

Figures (10)

Figure 1: Overlap of BIOMEDICA dataset with the Landscape of Biomedical Research gonzalez2024landscape Each color region reflects thematic concentrations, capturing the diversity of topics within our dataset. Gray points represent articles not present in BIOMEDICA.
Figure 2: BIOMEDICA curation pipeline: In the Extract phase, metadata, text (caption, figure reference, full-text), and images are sourced and processed from PMC-OA. In the Transform phase, DINO v2 features are generated for each image, followed by clustering using PCA and k-means. Clinicians and scientists annotate these clusters, identifying 12 global concepts and 170 local concepts, which are then propagated across all images. Finally, in the Load phase, the dataset is made available on Hugging Face with the listed features.
Figure 3: Left: Examples of images included in the BIOMEDICA dataset, ranging from clinical imaging to maps and bar plots. The word cloud reflects the fine-grained local concept proportions for the most frequent concepts in the BIOMEDICA dataset . Right: Visualization of the concept breakdown in the BIOMEDICA taxonomy. The left pie chart reflects the panel type (light blue indicates single panel images, dark blue indicates multi panel images) and the pie chart on the right shows the global concept of individual taxonomies.
Figure 4: Average model performance across different biomedical domains of best BMC-CLIP models compared to prior work.
Figure S1: BIOMEDICA cohort diagram: selection criteria for the construction of relevant image-caption pairs.
...and 5 more figures

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

TL;DR

Abstract

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Authors

TL;DR

Abstract

Table of Contents

Figures (10)