Table of Contents
Fetching ...

PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology

Fengchun Liu, Songhan Jiang, Linghan Cai, Ziyue Wang, Yongbing Zhang

TL;DR

PathFLIP tackles the challenge of fine-grained multimodal understanding in gigapixel whole slide images by introducing region-level language–image pretraining. It decomposes slide captions into region-level subcaptions and learns text-conditioned region embeddings via Region and Slide Q-Formers, complemented by global slide-caption alignment. By integrating with large language models, PathFLIP achieves instruction-following, captioning, VQA, and robust zero-shot classification and retrieval, while delivering accurate visual grounding without region annotations. The approach demonstrates superior performance across multiple pathology benchmarks with substantially less training data, offering a practical, interpretable path toward clinical AI-assisted diagnosis.

Abstract

While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.

PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology

TL;DR

PathFLIP tackles the challenge of fine-grained multimodal understanding in gigapixel whole slide images by introducing region-level language–image pretraining. It decomposes slide captions into region-level subcaptions and learns text-conditioned region embeddings via Region and Slide Q-Formers, complemented by global slide-caption alignment. By integrating with large language models, PathFLIP achieves instruction-following, captioning, VQA, and robust zero-shot classification and retrieval, while delivering accurate visual grounding without region annotations. The approach demonstrates superior performance across multiple pathology benchmarks with substantially less training data, offering a practical, interpretable path toward clinical AI-assisted diagnosis.

Abstract

While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.

Paper Structure

This paper contains 28 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of PathFLIP with previous methods in vision-language pathology modeling. (a) CLIP-based methods perform coarse global feature alignment between images and text. (b) Context-conditioned approaches use textual cues to guide attention but focus on global alignment. (c) Our PathFLIP enables fine-grained pathology analysis by aligning localized image embeddings with semantically matched text segments.
  • Figure 2: Overview of PathFLIP. Given a slide-caption pair <$S^i$, $T^i$>, the slide $S^i$ is divided into $N$ regions $\{S^{i}_{1}, \ldots, S^{i}_{N}\}$. We use Slide Q-Former and Region Q-Former to extract slide-level and region-level features. Captions $\{T^k, T^j, T^i\}$ are decomposed and sampled to obtain region-level subcaptions $\{T^{k}_1, T^{j}_2, T^{i}_3\}$. The slide-level contrastive loss $\mathcal{L}_{slide}$ aligns the global image feature $v^i$ with its corresponding text feature $t^i$. The region-level contrastive loss $\mathcal{L}_{region}$ encourages alignment between region-image and subcaption pairs from the same slide as positive pairs, while treating all others in the batch as negatives.
  • Figure 3: PathFLIP serves as a versatile tool in computational pathology. It accommodates a diverse range of multimodal pathology tasks at both slide and region levels. In (b), "$Si$" refers to the similarity between the $i$-th slide and the input text.
  • Figure 4: Visual grounding results. High-attention areas are highlighted in red in the heatmap, with red boxes marking the corresponding regions.
  • Figure 5: Caption generation comparison. Blue indicates correct matches, red indicates incorrect or imprecise matches, and yellow backgrounds emphasize important information matches.