Table of Contents
Fetching ...

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

Yingxue Xu, Yihui Wang, Fengtao Zhou, Jiabo Ma, Cheng Jin, Shu Yang, Jinbang Li, Zhengyu Zhang, Chenglong Zhao, Huajun Zhou, Zhenhui Li, Huangjing Lin, Xin Wang, Jiguang Wang, Anjia Han, Ronald Cheong Kin Chan, Li Liang, Xiuming Zhang, Hao Chen

TL;DR

A multimodal pathology foundation model, mSTAR, is developed to integrate WSIs, pathology reports, and RNA-Seq data at the whole-slide level. The approach uses a two-stage pretraining: Stage 1 builds a multimodal slide aggregator through slide-level contrastive learning, and Stage 2 performs Self-Taught Training to transfer multimodal knowledge to patch extractors, enabling patches to carry whole-slide context. Evaluated across 97 practical oncological tasks, mSTAR consistently outperforms state-of-the-art baselines in pathological diagnosis, molecular prediction, vision-language tasks, survival, and multimodal fusion, with strong external generalization and zero-shot capabilities. The work highlights modality scalability as a key principle for pathology foundation models, while acknowledging data scale and end-to-end pretraining challenges as avenues for future improvement.

Abstract

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or image-caption data, disregarding pathology reports with more clinically authentic information from pathologists and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Even recent slide-level FMs still struggle to provide whole-slide context for patch representation. In this study, for the first time, we develop a pathology foundation model incorporating three levels of modalities: pathology slides, pathology reports, and gene expression data, which resulted in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types, amounting to over 116 million pathological patch images. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm that injects the multimodal whole-slide context into the patch representation, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the pretraining workflow for CPath, enabling the pathology FM to acquire the whole-slide context. To the best of our knowledge, this is the first attempt to incorporate three modalities at the whole-slide context for enhancing pathology FMs. To systematically evaluate the capabilities of mSTAR, we built the largest spectrum of oncological benchmark, spanning 7 categories of oncological applications in 15 types of 97 practical oncological tasks.

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

TL;DR

A multimodal pathology foundation model, mSTAR, is developed to integrate WSIs, pathology reports, and RNA-Seq data at the whole-slide level. The approach uses a two-stage pretraining: Stage 1 builds a multimodal slide aggregator through slide-level contrastive learning, and Stage 2 performs Self-Taught Training to transfer multimodal knowledge to patch extractors, enabling patches to carry whole-slide context. Evaluated across 97 practical oncological tasks, mSTAR consistently outperforms state-of-the-art baselines in pathological diagnosis, molecular prediction, vision-language tasks, survival, and multimodal fusion, with strong external generalization and zero-shot capabilities. The work highlights modality scalability as a key principle for pathology foundation models, while acknowledging data scale and end-to-end pretraining challenges as avenues for future improvement.

Abstract

Remarkable strides in computational pathology have been made in the task-agnostic foundation model that advances the performance of a wide array of downstream clinical tasks. Despite the promising performance, there are still several challenges. First, prior works have resorted to either vision-only or image-caption data, disregarding pathology reports with more clinically authentic information from pathologists and gene expression profiles which respectively offer distinct knowledge for versatile clinical applications. Second, the current progress in pathology FMs predominantly concentrates on the patch level, where the restricted context of patch-level pretraining fails to capture whole-slide patterns. Even recent slide-level FMs still struggle to provide whole-slide context for patch representation. In this study, for the first time, we develop a pathology foundation model incorporating three levels of modalities: pathology slides, pathology reports, and gene expression data, which resulted in 26,169 slide-level modality pairs from 10,275 patients across 32 cancer types, amounting to over 116 million pathological patch images. To leverage these data for CPath, we propose a novel whole-slide pretraining paradigm that injects the multimodal whole-slide context into the patch representation, called Multimodal Self-TAught PRetraining (mSTAR). The proposed paradigm revolutionizes the pretraining workflow for CPath, enabling the pathology FM to acquire the whole-slide context. To the best of our knowledge, this is the first attempt to incorporate three modalities at the whole-slide context for enhancing pathology FMs. To systematically evaluate the capabilities of mSTAR, we built the largest spectrum of oncological benchmark, spanning 7 categories of oncological applications in 15 types of 97 practical oncological tasks.
Paper Structure (14 sections, 4 equations, 22 figures, 40 tables)

This paper contains 14 sections, 4 equations, 22 figures, 40 tables.

Figures (22)

  • Figure 1: Overview of the study. a, The workflow in clinical practice for diagnosis, treatment and prognosis of oncology, which primarily involves three common modalities data: WSIs, pathology reports and gene expression profiles. b, The overview of mSTAR paradigm. mSTAR consists of two stages: 1) Slide-level Contrastive Learning, and 2) Patch-level Self-Taught Training. c-e, statistics of data used in this study, including c) Venn Graph of cases across various modalities, d) the number of cases in pretraining data across different cancer types. e) the distribution of word count for pathology reports. f, evaluation scheme in this study: including held-out, independent, external and zero-shot. The illustration is presented in Sec. \ref{['sec:eval']}. g, the distribution of datasets across different types of tasks for different evaluation scheme, and the detailed information about every dataset is presented in Extended Data Table \ref{['tab:all_ds']}. h, The average performance spanning 15 types of 97 tasks across 7 categories of applications: Pathological Diagnosis, Molecular Prediction, Report Generation, Survival Prediction, Multimodal Fusion, Zero-shot Slide Classification, and Zero-shot Slide Retrieval. Zero-shot tasks, which require a well-aligned vision-language space, are evaluated for vision-language models only, i.e., PLIP, CONCH and mSTAR. (See Extended Data Table \ref{['tab:overall_performance']})
  • Figure 2: The Overview of mSTAR Pipeline. mSTAR is a whole-slide pretraining paradigm comprising two-stage pretraining. a, Stage 1 aims to inject multimodal knowledge into a slide aggregator by slide-level contrastive learning among WSIs, pathology reports and gene expression data. b, Stage 2 aims to seamlessly propagate multimodal knowledge learned at the slide level into the patch extractor by Self-Taught training, which leverages the slide aggregator pretrained in Stage 1 as "Teacher" and enforces patch extractor to be "Student".
  • Figure 3: Performance of Pathological Diagnosis on 21 datasets. a, The overall performance on pathological diagnosis. b, The performance on 8 independent datasets. c, The performance on 10 external datasets. The red lines and the values reported at the top of figures a, b and c refer to the averaged performance across datasets. Each point represents a dataset, with the size of the point indicating the standard deviation. d, The performance on 3 held-out datasets. e, Task distribution of pathological diagnosis across sites for different evaluation. f, The overall performance on Pathological Subtyping across 10 datasets. g, The performance on 6 external datasets of Pathological Subtyping. h-i The visualized validation of attention scores from mSTAR on h) CAMELYON and i) PANDA datasets. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. * represents $P<0.05$, ** means $P<0.01$ and *** indicates $P<0.001$. Detailed Performances of every dataset are presented in Extended Data Fig. \ref{['fig:dianosis-raw']} and Tab. \ref{['tab:diagnosis']}.
  • Figure 4: Performance of Molecular Prediction on 40 datasets across 10 cancer types. a, Overall Performance of Gene Mutation Prediction on 23 datasets. b, Performance of Mutation Prediction on 18 held-out datasets. c, Overall Performance of Immunohistochemistry (IHC) Biomarker Prediction on 10 datasets. d, Performance of IHC Biomarker Prediction on 4 independent datasets. e, Overall Performance of Molecular Subtyping on 7 datasets. f, Performance of Molecular Subtyping on 4 held-out datasets. The red lines and the values reported at the top of figures a-f refer to the averaged performance across datasets. Each point represents a dataset, with the size of the point indicating the standard deviation. g, Positive and Negative Ratios of gene mutation for every mutation dataset, including genes with high-frequency mutations highlighted in green and genes related to FDA-approved therapies highlighted in red. h-j, Internal (In) v.s. External (Ext) Evaluation.(h), Performance of Mutation Prediction on 5 internal and 5 external datasets. (i), Performance of IHC Biomarker Prediction on 3 internal and 3 external datasets. (j), Performance of Molecular Subtyping on 3 internal and 3 external datasets. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. * represents $P<0.05$, ** means $P<0.01$ and *** indicates $P<0.001$. Detailed performances of every dataset spanning 10 cancer types are presented in Extended Data Fig. \ref{['fig:molecular-raw']} and Tab. \ref{['tab:mutation_abmil']}-\ref{['tab:molecular']}.
  • Figure 5: Vision-language Evaluation.a, The scheme of zero-shot evaluation. For zero-shot classification, we used class prompts as the text input. For zero-shot retrieval, the text input is a pathology report. b, Performance of zero-shot slide classification on 6 independent datasets. The 'Overall' refers to the averaged performance across these 6 datasets. P-value is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. c, Performance of zero-shot retrieval on an external dataset for Image-to-Text and Text-to-Image tasks. The results on held-out TCGA dataset are presented for reference only to be compared with zero-shot's capability. d, Performance of report generation on one held-out TCGA dataset and two external datasets. P-value for every group of experiments is given through one-sided Wilcoxon signed-rank test between mSTAR and the second-best FM. Detailed performances of every dataset are presented in Extended Data Tab. \ref{['tab:zeroshot_overall']}-\ref{['tab:report_gen']}.
  • ...and 17 more figures