From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan; Min Woo Sun; Zhen Chen; Alejandro Lozano; Xiangteng He; Shi Li; Nassir Navab; Xiaoxiao Sun; Nicolas Padoy; Serena Yeung-Levy

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy

TL;DR

Panel2Patch introduces a scalable data-generation pipeline that automatically mines hierarchical supervision (figure, panel, region) from biomedical literature, paired with a hierarchical zoom-in pretraining objective. By leveraging Set-of-Markers prompts and region-grounded captions via LVLMs, it achieves multi-granularity vision-language alignment with data efficiency, outperforming prior models trained on much larger corpora. The method demonstrates strong panel- and region-level retrieval and grounding, data-efficient zero-shot classification across specialties, and robust grounding in cell-imaging contexts. Overall, exploiting the structured pedagogical layout of scientific figures enables substantial gains without extra annotation, offering a practical path to biomedical foundation models.

Abstract

There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

TL;DR

Abstract

From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)