Table of Contents
Fetching ...

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

TL;DR

Panel2Patch introduces a scalable data-generation pipeline that automatically mines hierarchical supervision (figure, panel, region) from biomedical literature, paired with a hierarchical zoom-in pretraining objective. By leveraging Set-of-Markers prompts and region-grounded captions via LVLMs, it achieves multi-granularity vision-language alignment with data efficiency, outperforming prior models trained on much larger corpora. The method demonstrates strong panel- and region-level retrieval and grounding, data-efficient zero-shot classification across specialties, and robust grounding in cell-imaging contexts. Overall, exploiting the structured pedagogical layout of scientific figures enables substantial gains without extra annotation, offering a practical path to biomedical foundation models.

Abstract

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

TL;DR

Panel2Patch introduces a scalable data-generation pipeline that automatically mines hierarchical supervision (figure, panel, region) from biomedical literature, paired with a hierarchical zoom-in pretraining objective. By leveraging Set-of-Markers prompts and region-grounded captions via LVLMs, it achieves multi-granularity vision-language alignment with data efficiency, outperforming prior models trained on much larger corpora. The method demonstrates strong panel- and region-level retrieval and grounding, data-efficient zero-shot classification across specialties, and robust grounding in cell-imaging contexts. Overall, exploiting the structured pedagogical layout of scientific figures enables substantial gains without extra annotation, offering a practical path to biomedical foundation models.

Abstract

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

Paper Structure

This paper contains 41 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Our Panel2Patch pipeline generates additional fine-grained vision-language supervision signals and enhances the multi-modal representation through cross-level message passing.
  • Figure 2: Overview of our Panel2Patch pipeline (I, II) and hierarchical pretraining (III). (I) Multi-panel figures are split into single panels and associated with their captions. (II) Complementary LVLM-based agents detect accurate biomedical regions of interest. (III) Hierarchical pretraining learns from multi-level vision–language correspondences and enhances embedding space across panel-, patch-, and region-level representations.
  • Figure 3: Qualitative examples of fine-grained retrieval using bounding box-cropped images and region-level texts.
  • Figure 4: Prompt for SoM-guided panel decomposition. We show an example prompt and LVLM response for decomposing a multi-panel biomedical figure into individual panels, each with a panel ID, bounding box, and short description. The prompt enforces a strict JSON schema, which facilitates reliable downstream parsing.
  • Figure 5: Prompts for marker-guided region mining. Example prompt and LVLM output for detecting visual markers (arrows, stars, etc.) and producing bounding boxes for the corresponding target regions. Each entry includes a marker ID, target box, short label, and a one-sentence description.
  • ...and 5 more figures