Table of Contents
Fetching ...

To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?

Weiming Chen, Xitong Ling, Xidong Wang, Zhenyang Cai, Yijia Guo, Mingxi Fu, Ziyi Zeng, Minxi Ouyang, Jiawen Li, Yizhi Wang, Tian Guan, Benyou Wang, Yonghong He

TL;DR

This work addresses how token-level representations from pathology foundation models translate to dense segmentation, revealing that performance is not simply a function of larger models or more adaptation parameters. Through PFM-DenseBench, the authors evaluate 17 PFMs across 18 segmentation datasets with multiple fine-tuning strategies, finding that locality biases (e.g., CNN adapters) and appropriate input granularity (around 1024^2) are pivotal for pixel-precise predictions. They show that scaling laws common in other domains do not consistently apply to dense pathology tasks, and provide mechanistic insights via token-similarity analyses to explain why higher-capacity adapters saturate. The results offer practical guidance for selecting and adapting PFMs for dense pathology tasks and advocate moving toward dense-representation-focused backbone designs to reliably quantify complex tissue morphologies. All metrics and concepts are presented with formal notation where applicable, e.g., $mIoU$, Dice, PixelAccuracy, FWIoU, Precision, Recall, and F1 across datasets and models.

Abstract

Pathology foundation models (PFMs) have rapidly advanced and are becoming a common backbone for downstream clinical tasks, offering strong transferability across tissues and institutions. However, for dense prediction (e.g., segmentation), practical deployment still lacks a clear, reproducible understanding of how different PFMs behave across datasets and how adaptation choices affect performance and stability. We present PFM-DenseBench, a large-scale benchmark for dense pathology prediction, evaluating 17 PFMs across 18 public segmentation datasets. Under a unified protocol, we systematically assess PFMs with multiple adaptation and fine-tuning strategies, and derive insightful, practice-oriented findings on when and why different PFMs and tuning choices succeed or fail across heterogeneous datasets. We release containers, configs, and dataset cards to enable reproducible evaluation and informed PFM selection for real-world dense pathology tasks. Project Website: https://m4a1tastegood.github.io/PFM-DenseBench

To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?

TL;DR

This work addresses how token-level representations from pathology foundation models translate to dense segmentation, revealing that performance is not simply a function of larger models or more adaptation parameters. Through PFM-DenseBench, the authors evaluate 17 PFMs across 18 segmentation datasets with multiple fine-tuning strategies, finding that locality biases (e.g., CNN adapters) and appropriate input granularity (around 1024^2) are pivotal for pixel-precise predictions. They show that scaling laws common in other domains do not consistently apply to dense pathology tasks, and provide mechanistic insights via token-similarity analyses to explain why higher-capacity adapters saturate. The results offer practical guidance for selecting and adapting PFMs for dense pathology tasks and advocate moving toward dense-representation-focused backbone designs to reliably quantify complex tissue morphologies. All metrics and concepts are presented with formal notation where applicable, e.g., , Dice, PixelAccuracy, FWIoU, Precision, Recall, and F1 across datasets and models.

Abstract

Pathology foundation models (PFMs) have rapidly advanced and are becoming a common backbone for downstream clinical tasks, offering strong transferability across tissues and institutions. However, for dense prediction (e.g., segmentation), practical deployment still lacks a clear, reproducible understanding of how different PFMs behave across datasets and how adaptation choices affect performance and stability. We present PFM-DenseBench, a large-scale benchmark for dense pathology prediction, evaluating 17 PFMs across 18 public segmentation datasets. Under a unified protocol, we systematically assess PFMs with multiple adaptation and fine-tuning strategies, and derive insightful, practice-oriented findings on when and why different PFMs and tuning choices succeed or fail across heterogeneous datasets. We release containers, configs, and dataset cards to enable reproducible evaluation and informed PFM selection for real-world dense pathology tasks. Project Website: https://m4a1tastegood.github.io/PFM-DenseBench
Paper Structure (29 sections, 8 figures, 93 tables)

This paper contains 29 sections, 8 figures, 93 tables.

Figures (8)

  • Figure 1: Overview of PFM-DenseBench: A unified benchmark for evaluating Pathology Foundation Models on dense prediction. The framework comprises three stages: (1) Dataset Curation: 18 public datasets covering nuclei-, gland-, and tissue-level segmentation across multiple organs. (2) Model and Strategy Evaluation: 17 vision-only and vision-language PFMs evaluated under five adaptation strategies, including LoRA/DoRA and CNN/Transformer adapters. (3) Benchmark Validation: A standardized protocol with task-specific fine-tuning, multi-metric evaluation, and qualitative analysis to characterize transfer effectiveness and scaling behavior.
  • Figure 2: Schematic illustration of parameter-efficient adaptation strategies for dense prediction with frozen Pathology Foundation Models.(A) Low-Rank Adaptation (LoRA/DoRA): Trainable low-rank dynamic weights (purple) are injected into the Query (Q), Key (K), Value (V), and Output (O) projections of the frozen self-attention layers, decoupling optimization from the massive encoder parameters. (B) CNN Adapter: A parallel ResNetV2-style CNN branch extracts multi-scale local features alongside the frozen PFM. These features are injected into the decoder via skip connections to recover fine-grained spatial details and enhance boundary delineation. (C) Transformer Adapter: A sequential adaptation module that appends trainable Transformer blocks to the frozen encoder. This strategy processes the full token sequence to refine global semantic representations specifically for the downstream segmentation task.
  • Figure 3: Benchmarking Pathology Foundation Models for Dense Prediction: Segmentation Performance Across Datasets and Fine-tuning Strategies. Each box aggregates the evaluation results across 17 PFMs for a given fine-tuning strategie.
  • Figure 4: Scaling behavior of pathology foundation models under frozen method for dense prediction.
  • Figure 5: LoRA rank ablation across representative pathology segmentation regimes with 95% bootstrap confidence intervals. We evaluate LoRA fine-tuning with varying ranks and full fine-tuning on three representative datasets spanning cell (TNBC), gland (GlaS), and tissue (COSAS24) segmentation.
  • ...and 3 more figures