Table of Contents
Fetching ...

PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology

Jiabo Ma, Yingxue Xu, Fengtao Zhou, Yihui Wang, Cheng Jin, Zhengrui Guo, Jianfeng Wu, On Ki Tang, Huajun Zhou, Xi Wang, Luyang Luo, Zhengyu Zhang, Du Cai, Zizhao Gao, Wei Wang, Yueping Liu, Jiankun He, Jing Cui, Zhenhui Li, Jing Zhang, Feng Gao, Xiuming Zhang, Li Liang, Ronald Cheong Kin Chan, Zhe Wang, Hao Chen

TL;DR

This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

Abstract

The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

PathBench: A comprehensive comparison benchmark for pathology foundation models towards precision oncology

TL;DR

This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

Abstract

The emergence of pathology foundation models has revolutionized computational histopathology, enabling highly accurate, generalized whole-slide image analysis for improved cancer diagnosis, and prognosis assessment. While these models show remarkable potential across cancer diagnostics and prognostics, their clinical translation faces critical challenges including variability in optimal model across cancer types, potential data leakage in evaluation, and lack of standardized benchmarks. Without rigorous, unbiased evaluation, even the most advanced PFMs risk remaining confined to research settings, delaying their life-saving applications. Existing benchmarking efforts remain limited by narrow cancer-type focus, potential pretraining data overlaps, or incomplete task coverage. We present PathBench, the first comprehensive benchmark addressing these gaps through: multi-center in-hourse datasets spanning common cancers with rigorous leakage prevention, evaluation across the full clinical spectrum from diagnosis to prognosis, and an automated leaderboard system for continuous model assessment. Our framework incorporates large-scale data, enabling objective comparison of PFMs while reflecting real-world clinical complexity. All evaluation data comes from private medical providers, with strict exclusion of any pretraining usage to avoid data leakage risks. We have collected 15,888 WSIs from 8,549 patients across 10 hospitals, encompassing over 64 diagnosis and prognosis tasks. Currently, our evaluation of 19 PFMs shows that Virchow2 and H-Optimus-1 are the most effective models overall. This work provides researchers with a robust platform for model development and offers clinicians actionable insights into PFM performance across diverse clinical scenarios, ultimately accelerating the translation of these transformative technologies into routine pathology practice.

Paper Structure

This paper contains 23 sections, 9 figures, 26 tables.

Figures (9)

  • Figure 1: The workflow and overall results of PathBench. a. The data used for the benchmark. b. The workflow of evaluating foundation models. c. The evaluated foundation models. d. The average ranking score of evaluated models. f. The average performance on the pathological diagnosis, molecular diagnosis, and prognosis tasks.
  • Figure 2: Performance on lung cancer data. a-b. Performance on molecular classification tasks CK7 and TTF-1. c. Classification performance for primary LUAD versus metastatic lung cancer. d. Primary site prediction for metastatic lung cancer. e-f. Performance on molecular classification tasks C-MET and NpsinA. g. Overall ranking scores of foundation models on lung cancer data. The error bar indicates the standard deviation.
  • Figure 3: The overall results of foundation models on the breast cancer data. a. Overall survival analysis results. b. Disease-free survival analysis results. c. Molecular subtyping results. d. TNM N Staging prediction results. The error bars represent the standard deviation.
  • Figure 4: The overall results of foundation models on the breast cancer data. a-e. The molecular subtyping performance. f. pTNM Staging prediction results. g. The average ranking score of various foundation models on breast cancer tasks. The error bars represent the standard deviation.
  • Figure 5: Overall results of gastric cancer. a-b. The molecular subtyping results of HER-2 and S-100, respectively. c. Gastric cancer grading results. d. Lauren subtyping results. e. Perineural invasion detection results. The error bars represent the standard deviation.
  • ...and 4 more figures