To What Extent Do Token-Level Representations from Pathology Foundation Models Improve Dense Prediction?
Weiming Chen, Xitong Ling, Xidong Wang, Zhenyang Cai, Yijia Guo, Mingxi Fu, Ziyi Zeng, Minxi Ouyang, Jiawen Li, Yizhi Wang, Tian Guan, Benyou Wang, Yonghong He
TL;DR
This work addresses how token-level representations from pathology foundation models translate to dense segmentation, revealing that performance is not simply a function of larger models or more adaptation parameters. Through PFM-DenseBench, the authors evaluate 17 PFMs across 18 segmentation datasets with multiple fine-tuning strategies, finding that locality biases (e.g., CNN adapters) and appropriate input granularity (around 1024^2) are pivotal for pixel-precise predictions. They show that scaling laws common in other domains do not consistently apply to dense pathology tasks, and provide mechanistic insights via token-similarity analyses to explain why higher-capacity adapters saturate. The results offer practical guidance for selecting and adapting PFMs for dense pathology tasks and advocate moving toward dense-representation-focused backbone designs to reliably quantify complex tissue morphologies. All metrics and concepts are presented with formal notation where applicable, e.g., $mIoU$, Dice, PixelAccuracy, FWIoU, Precision, Recall, and F1 across datasets and models.
Abstract
Pathology foundation models (PFMs) have rapidly advanced and are becoming a common backbone for downstream clinical tasks, offering strong transferability across tissues and institutions. However, for dense prediction (e.g., segmentation), practical deployment still lacks a clear, reproducible understanding of how different PFMs behave across datasets and how adaptation choices affect performance and stability. We present PFM-DenseBench, a large-scale benchmark for dense pathology prediction, evaluating 17 PFMs across 18 public segmentation datasets. Under a unified protocol, we systematically assess PFMs with multiple adaptation and fine-tuning strategies, and derive insightful, practice-oriented findings on when and why different PFMs and tuning choices succeed or fail across heterogeneous datasets. We release containers, configs, and dataset cards to enable reproducible evaluation and informed PFM selection for real-world dense pathology tasks. Project Website: https://m4a1tastegood.github.io/PFM-DenseBench
