Table of Contents
Fetching ...

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

Alejandro Lozano, Min Woo Sun, James Burgess, Jeffrey J. Nirschl, Christopher Polzak, Yuhui Zhang, Liangyu Chen, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Collin Chiu, Orr Zohar, Xiaohan Wang, Alfred Seunghoon Song, Chiang Chia-Chun, Robert Tibshirani, Serena Yeung-Levy

TL;DR

This work introduces Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations).

Abstract

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

TL;DR

This work introduces Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations).

Abstract

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential. To address this gap, we introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset, containing over 6 million scientific articles and 24 million image-text pairs, along with 27 metadata fields (including expert human annotations). To overcome the challenges of accessing our large-scale dataset, we provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems. We demonstrate the utility of the Biomedica dataset by building embedding models, chat-style models, and retrieval-augmented chat agents. Notably, all our AI models surpass previous open systems in their respective categories, underscoring the critical role of diverse, high-quality, and large-scale biomedical data.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the BIOMEDICA dataset, tools for accessibility, and its applications. (A) Overlap of BIOMEDICA with the Landscape of Biomedical Research gonzalez2024landscape: The dataset comprises 6 million open-access articles, 24 million image-caption pairs, and 30 million in-line references, spanning diverse biomedical domains such as clinical radiology and pathology images, research microscopy, immunoassays, chemical structures, among other scientific images. (B) To facilitate AI model development and inference, we offer data streaming, filtering, and the BIOMEDICA Index. Streaming enables efficient training without the need for extensive local storage. Data filtering allows users to create domain-specific subsets of the data. The BIOMEDICA Index supports multi-modal retrieval-based applications. (C) The BIOMEDICA dataset enables diverse biomedical applications, including chat models, embedding models, and agentic systems.
  • Figure 2: Overview of the BIOMEDICA dataset statistics and annotations: (A) List of metadata fields provided in the dataset, along with their respective sources of provenance. (B) Summary statistics for the dataset's text tokens, characters, and image dimensions. Image statistics include width, height, and area in pixels. (C) Distribution of image annotations in the dataset. The word cloud on the left visualizes the most common concept labels assigned during annotation. The left pie chart shows the proportion of images categorized as single-panel or multi-panel. The right pie chart presents the distribution of global image classes, representing various biomedical categories.
  • Figure 3: BIOMEDICA enables state of the art performance across multiple applications. (A) Multimodal embedding model performance on biomedical image classification tasks. (B) Autoregressive model performance on biomedical VQA tasks (results for other previous models are obtained from zhang2023huatuogpt). (C) Autoregressive model performance on biomedical guidelines QA across four LLMs with and without retrieval augmentation using the BIOMEDICA Index.
  • Figure 4: Query time as a function of the number of words (tokens) in the BIOMEDICA index. The blue solid line represents the mean query time, while the shaded blue region indicates the standard error. The dashed gray line denotes the linear trend in query time as token count increases $(R=0.902)$.