Table of Contents
Fetching ...

A Clinical Benchmark of Public Self-Supervised Pathology Foundation Models

Gabriele Campanella, Shengjia Chen, Ruchika Verma, Jennifer Zeng, Aryeh Stock, Matt Croken, Brandon Veremis, Abdulkadir Elmas, Kuan-lin Huang, Ricky Kwan, Jane Houldsworth, Adam J. Schoenfeld, Chad Vanderbilt

TL;DR

The paper addresses the need to systematically benchmark publicly available pathology foundation models trained with self-supervised learning. It constructs a clinical benchmark from two institutions spanning disease detection, biomarker prediction, and treatment-outcome endpoints, and evaluates tile-level encoders via a slide-level Gated MIL aggregation. Key findings show that modern DINO and DINOv2-based models generally outperform ImageNet baselines, with UNI and Prov-GigaPath achieving strong performance on several biomarkers, while larger model size does not guarantee gains for detection and benefits for biomarker tasks depend on pretraining data composition. The work provides a practical framework for model comparison, highlights the importance of training data composition, and outlines directions for improving benchmarks and enabling public, iterative evaluation of pathology foundation models.

Abstract

The use of self-supervised learning (SSL) to train pathology foundation models has increased substantially in the past few years. Notably, several models trained on large quantities of clinical data have been made publicly available in recent months. This will significantly enhance scientific research in computational pathology and help bridge the gap between research and clinical deployment. With the increase in availability of public foundation models of different sizes, trained using different algorithms on different datasets, it becomes important to establish a benchmark to compare the performance of such models on a variety of clinically relevant tasks spanning multiple organs and diseases. In this work, we present a collection of pathology datasets comprising clinical slides associated with clinically relevant endpoints including cancer diagnoses and a variety of biomarkers generated during standard hospital operation from two medical centers. We leverage these datasets to systematically assess the performance of public pathology foundation models and provide insights into best practices for training new foundation models and selecting appropriate pretrained models.

A Clinical Benchmark of Public Self-Supervised Pathology Foundation Models

TL;DR

The paper addresses the need to systematically benchmark publicly available pathology foundation models trained with self-supervised learning. It constructs a clinical benchmark from two institutions spanning disease detection, biomarker prediction, and treatment-outcome endpoints, and evaluates tile-level encoders via a slide-level Gated MIL aggregation. Key findings show that modern DINO and DINOv2-based models generally outperform ImageNet baselines, with UNI and Prov-GigaPath achieving strong performance on several biomarkers, while larger model size does not guarantee gains for detection and benefits for biomarker tasks depend on pretraining data composition. The work provides a practical framework for model comparison, highlights the importance of training data composition, and outlines directions for improving benchmarks and enabling public, iterative evaluation of pathology foundation models.

Abstract

The use of self-supervised learning (SSL) to train pathology foundation models has increased substantially in the past few years. Notably, several models trained on large quantities of clinical data have been made publicly available in recent months. This will significantly enhance scientific research in computational pathology and help bridge the gap between research and clinical deployment. With the increase in availability of public foundation models of different sizes, trained using different algorithms on different datasets, it becomes important to establish a benchmark to compare the performance of such models on a variety of clinically relevant tasks spanning multiple organs and diseases. In this work, we present a collection of pathology datasets comprising clinical slides associated with clinically relevant endpoints including cancer diagnoses and a variety of biomarkers generated during standard hospital operation from two medical centers. We leverage these datasets to systematically assess the performance of public pathology foundation models and provide insights into best practices for training new foundation models and selecting appropriate pretrained models.
Paper Structure (12 sections, 4 figures, 4 tables)

This paper contains 12 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Benchmarking Results: Detection Tasks.
  • Figure 2: Benchmarking Results: Biomarker Prediction Tasks.
  • Figure 3: Scaling Laws: downstream performance vs foundation model size.
  • Figure 4: Scaling Laws: downstream performance vs computational resources used for pretraining the foundation models.