Table of Contents
Fetching ...

Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation

Jiabo Ma, Zhengrui Guo, Fengtao Zhou, Yihui Wang, Yingxue Xu, Jinbang Li, Fang Yan, Yu Cai, Zhengjie Zhu, Cheng Jin, Yi Lin, Xinrui Jiang, Chenglong Zhao, Danyi Li, Anjia Han, Zhenhui Li, Ronald Cheong Kin Chan, Jiguang Wang, Peng Fei, Kwang-Ting Cheng, Shaoting Zhang, Li Liang, Hao Chen

TL;DR

This work introduces GPFM, a Generalizable Pathology Foundation Model, trained with a unified knowledge distillation framework that combines self-distillation and expert knowledge distillation from multiple pathology FMs. Using a large, diverse pretraining corpus of 95,572 WSIs (34 tissue types) and 190 million patches, GPFM is evaluated across 72 tasks spanning six clinical categories, including WSI classification, survival analysis, ROI classification, image retrieval, VQA, and report generation. GPFM achieves state-of-the-art generalization, with an average task-rank of 1.6 and first place on 42 tasks, significantly outperforming specialized models like UNI and others; ablations confirm the value of expert distillation. The approach demonstrates practical potential for a versatile, data-efficient pathology AI system, enabling robust transfer across diverse clinical tasks while preserving privacy by distilling knowledge from multiple existing models without requiring access to their original data.”

Abstract

Foundation models pretrained on large-scale datasets are revolutionizing the field of computational pathology (CPath). The generalization ability of foundation models is crucial for the success in various downstream clinical tasks. However, current foundation models have only been evaluated on a limited type and number of tasks, leaving their generalization ability and overall performance unclear. To address this gap, we established a most comprehensive benchmark to evaluate the performance of off-the-shelf foundation models across six distinct clinical task types, encompassing a total of 72 specific tasks, including slide-level classification, survival prediction, ROI-tissue classification, ROI retrieval, visual question answering, and report generation. Our findings reveal that existing foundation models excel at certain task types but struggle to effectively handle the full breadth of clinical tasks. To improve the generalization of pathology foundation models, we propose a unified knowledge distillation framework consisting of both expert and self-knowledge distillation, where the former allows the model to learn from the knowledge of multiple expert models, while the latter leverages self-distillation to enable image representation learning via local-global alignment. Based on this framework, we curated a dataset of 96,000 whole slide images (WSIs) and developed a Generalizable Pathology Foundation Model (GPFM). This advanced model was trained on a substantial dataset comprising 190 million images extracted from approximately 72,000 publicly available slides, encompassing 34 major tissue types. Evaluated on the established benchmark, GPFM achieves an impressive average rank of 1.6, with 42 tasks ranked 1st, while the second-best model, UNI, attains an average rank of 3.7, with only 6 tasks ranked 1st.

Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation

TL;DR

This work introduces GPFM, a Generalizable Pathology Foundation Model, trained with a unified knowledge distillation framework that combines self-distillation and expert knowledge distillation from multiple pathology FMs. Using a large, diverse pretraining corpus of 95,572 WSIs (34 tissue types) and 190 million patches, GPFM is evaluated across 72 tasks spanning six clinical categories, including WSI classification, survival analysis, ROI classification, image retrieval, VQA, and report generation. GPFM achieves state-of-the-art generalization, with an average task-rank of 1.6 and first place on 42 tasks, significantly outperforming specialized models like UNI and others; ablations confirm the value of expert distillation. The approach demonstrates practical potential for a versatile, data-efficient pathology AI system, enabling robust transfer across diverse clinical tasks while preserving privacy by distilling knowledge from multiple existing models without requiring access to their original data.”

Abstract

Foundation models pretrained on large-scale datasets are revolutionizing the field of computational pathology (CPath). The generalization ability of foundation models is crucial for the success in various downstream clinical tasks. However, current foundation models have only been evaluated on a limited type and number of tasks, leaving their generalization ability and overall performance unclear. To address this gap, we established a most comprehensive benchmark to evaluate the performance of off-the-shelf foundation models across six distinct clinical task types, encompassing a total of 72 specific tasks, including slide-level classification, survival prediction, ROI-tissue classification, ROI retrieval, visual question answering, and report generation. Our findings reveal that existing foundation models excel at certain task types but struggle to effectively handle the full breadth of clinical tasks. To improve the generalization of pathology foundation models, we propose a unified knowledge distillation framework consisting of both expert and self-knowledge distillation, where the former allows the model to learn from the knowledge of multiple expert models, while the latter leverages self-distillation to enable image representation learning via local-global alignment. Based on this framework, we curated a dataset of 96,000 whole slide images (WSIs) and developed a Generalizable Pathology Foundation Model (GPFM). This advanced model was trained on a substantial dataset comprising 190 million images extracted from approximately 72,000 publicly available slides, encompassing 34 major tissue types. Evaluated on the established benchmark, GPFM achieves an impressive average rank of 1.6, with 42 tasks ranked 1st, while the second-best model, UNI, attains an average rank of 3.7, with only 6 tasks ranked 1st.
Paper Structure (24 sections, 15 figures, 52 tables)

This paper contains 24 sections, 15 figures, 52 tables.

Figures (15)

  • Figure 1: Overview of the GPFM. GPFM is a state-of-the-art pretrained FM that demonstrates exceptional performance across 72 diverse tasks. a. The GPFM dataset comprises a large-scale collection of 95,572 slides spanning 34 major tissue types, enabling comprehensive model training and evaluation. b-c. Performance evaluation of foundation models (FMs) across a diverse set of tasks: 52 internal tasks and 20 external tasks. Only the top 4 models are presented here. For a more comprehensive analysis, including additional FMs, please refer to Fig.\ref{['fig:overall_fig']}. d. The overview of unified knowledge distillation for GPFM. The experts used for Expert Knowledge Distillation will be selected based on their average performance on six different clinical tasks. The pretraining algorithm includes three key components: 1) Mask Image Modeling (MIM), 2) Self-Distillation, and 3) Expert Knowledge Distillation. The parameters of GPFM are updated through Exponential Moving Average (EMA).
  • Figure 2: Comprehensive Comparison of FMs across 72 Tasks.a. Task types evaluated by different FMs. b. Average performance of FMs across 72 tasks: WSI classification and tissue classification tasks are measured by AUC; survival analysis tasks are measured by C-index; the VQA task is measured by overall accuracy; the report generation task is measured by the average metric of BLEU, METEOR, and ROUGE-L; the image retrieval task is measured by average accuracy. The Wilcoxon signed-rank two-side test is employed to detect significant differences between off-the-shelf FMs and the proposed GPFM. The error bars in b and c indicate the 95% CI. The figure demonstrates that GPFM achieved the highest average performance. c. Average rank of FMs across 72 downstream tasks. The box limits represent the standard error. d.Critical differences (CD) diagram of average ranking score with the Nemenyi test. In the CD figure, there are no significant differences between the models covered by the black line. e-f. Ranking order of FMs across 32 and 20 internal tasks, respectively. g. Ranking order of FMs on 20 external validation datasets. If a model achieves the best performance, its rank value is set to 1. If two models have the same metric value, indicating a tie, the average rank value is assigned to all the tied models. For WSI-VQA, the rank is determined by the average of linguistic evaluation metrics and closed accuracy. The evaluation metrics utilized to derive the ranking scores for the remaining tasks are consistent with those applied in subfigure b.
  • Figure 3: Performance of FMs on WSI Classification Tasks.a. Average ranking of FMs based on AUC across 36 WSI classification tasks. b-d. Average balanced accuracy (ACC), and weighted F1 score (F1), and AUC of FMs across 36 WSI classification tasks. e. Average AUC of FMs on 20 internal WSI classification tasks. f. Average AUC of FMs on 16 external validation cohorts. g-h. Model performance on specific tasks: RCC subtyping, vascular invasion detection, ovarian cancer subtyping, and breast carcinoma subtyping. * represents external validation cohorts. Error bars represent 95% CI. The box limits represent the standard error. Additional results are shown in Extended Data Fig.\ref{['fig:WSI_ext1']} and Fig.\ref{['fig:WSI_ext2']}.
  • Figure 4: Performance of FMs across 15 Survival Analysis Tasks.a. Average ranking of FMs in 15 survival analysis tasks. b. Average C-Index of various FMs across 15 tasks. c. Results on TCGA-HNSC data and the HANCOCK cohort. The survival prediction model was trained on the TCGA-HNSC cohort and subsequently tested on the HANCOCK cohort. d-f. C-Index of FMs across 12 survival analysis tasks. In all subfigures, error bars indicate 95% CI. For box plots, the center line represents the mean, and the box limits represent the standard error.
  • Figure 5: Performance of FMs on Tissue Classification Tasks.a. Average ranking order of FMs based on AUC across 16 tasks. b-d. Average balanced accuracy (ACC), and weighted F1 score (F1), and AUC of FMs across 16 tasks. The center line represents mean and the box limits represents the standard error. e-i. AUC of FMs across 5 tissue classification tasks. The Wilcoxon signed-rank one-side test is adopted to detect significant difference. Then center black line in the violin plot represents the mean AUC. j. Tumor infiltrating lymphocytes classification based on the PanCancer-TIL (internal) and Center-3-TIL data (external). k. Gastric cancer tissue classificaiton with GasHisDB (internal) and Center-3-GC data (external). In all subfigures, the error bars indicate 95% CI. More results are presented in Extended Data Fig.\ref{['fig:extra_roi_results']}.
  • ...and 10 more figures