Table of Contents
Fetching ...

Multimodal Whole Slide Foundation Model for Pathology

Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y. Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Bowen Chen, Cristina Almagro-Perez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Daisuke Komura, Akihiro Kawabe, Shumpei Ishikawa, Georg Gerber, Tingying Peng, Long Phi Le, Faisal Mahmood

TL;DR

We introduce TITAN, a Transformer-based multimodal whole-slide foundation model for pathology that yields general-purpose slide embeddings and cross-modal capabilities. Pretrained on 335,645 WSIs and 182,862 medical reports, TITAN leverages 423,122 synthetic ROI captions (PathChat) and 183K slide-level reports across three-stage vision-language training to align ROI and slide representations with text. Its slide encoder uses a ViT with 2D ALiBi to enable long-context extrapolation from ROI blocks to full WSIs, and its unimodal vision pretraining is complemented by ROI-caption and slide-report alignment. Across a broad suite of tasks—including morphological subtyping, molecular classification, survival, rare-cancer retrieval, cross-modal retrieval, and pathology report generation—TITAN consistently outperforms ROI- and slide-based foundation models, exhibiting strong zero-shot and few-shot performance and practical multimodal capabilities.

Abstract

The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL). However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose TITAN, a multimodal whole slide foundation model pretrained using 335,645 WSIs via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any finetuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that TITAN outperforms both ROI and slide foundation models across machine learning settings such as linear probing, few-shot and zero-shot classification, rare cancer retrieval and cross-modal retrieval, and pathology report generation.

Multimodal Whole Slide Foundation Model for Pathology

TL;DR

We introduce TITAN, a Transformer-based multimodal whole-slide foundation model for pathology that yields general-purpose slide embeddings and cross-modal capabilities. Pretrained on 335,645 WSIs and 182,862 medical reports, TITAN leverages 423,122 synthetic ROI captions (PathChat) and 183K slide-level reports across three-stage vision-language training to align ROI and slide representations with text. Its slide encoder uses a ViT with 2D ALiBi to enable long-context extrapolation from ROI blocks to full WSIs, and its unimodal vision pretraining is complemented by ROI-caption and slide-report alignment. Across a broad suite of tasks—including morphological subtyping, molecular classification, survival, rare-cancer retrieval, cross-modal retrieval, and pathology report generation—TITAN consistently outperforms ROI- and slide-based foundation models, exhibiting strong zero-shot and few-shot performance and practical multimodal capabilities.

Abstract

The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL). However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose TITAN, a multimodal whole slide foundation model pretrained using 335,645 WSIs via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any finetuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that TITAN outperforms both ROI and slide foundation models across machine learning settings such as linear probing, few-shot and zero-shot classification, rare cancer retrieval and cross-modal retrieval, and pathology report generation.

Paper Structure

This paper contains 1 section, 2 equations, 10 figures, 139 tables.

Table of Contents

  1. Ethics Statement

Figures (10)

  • Figure 1: Overview of $\textsc{TITAN}$. (a) Tissue site distribution of Mass-340K used for $\textsc{TITAN}_{\text{V}}$ pretraining (Stage 1). Mass-340K includes 335,645 WSIs across 20 organs with a mix of hematoxylin-and-eosin-stained (90.9%) and immunohistochemistry-stained tissue sections (9.1%) or a mix of neoplastic (70.0%) and non-neoplastic tissue sections (30.0%). $\textsc{TITAN}$ pretraining (Stages 2 and 3) uses a subset of Mass-340K with paired captions and medical reports. (b--d) Block diagram of $\textsc{TITAN}_{\text{V}}$ pretraining. (b)$\textsc{TITAN}$ uses a Vision Transformer to encode a WSI into a slide embedding. (c)$\textsc{TITAN}_{\text{V}}$ (Stage 1) is pretrained using self-supervised learning with student--teacher knowledge distillation (d)$\textsc{TITAN}$ (Stage 2 and 3) is pretrained using vision-language modeling, first by aligning the slide embedding with synthetic captions (Stage 2) and then with medical reports (Stage 3). (e) UMAP visualization of TCGA slide embeddings obtained with $\textsc{TITAN}$, color-coded by organ. UMAP: uniform manifold approximation and projection.
  • Figure 1: Examples of TCGA-UT-8K dataset. Examples of TCGA-UT-8K, which are ROIs of $8,192\times 8,192$ pixel selected by the pathologists. The green contours illustrate the cancer region annotations, with the red number indicating the ROI index within a given TCGA slide.
  • Figure 2: $\textsc{TITAN}$ evaluation.(a) Impact of pretraining data size on $\textsc{TITAN}_{\text{V}}$ and baselines across four challenging subtyping tasks (TCGA-UT-8K, TCGA-OT, OT108 and EBRAINS). $\textsc{TITAN}_{\text{V}}$ is pretrained with 12.5%, 25%, 50%, and 100% of Mass-340K. (b) The average performance of the four tasks against the number of parameters for each baseline. (c) Linear probe evaluation of $\textsc{TITAN}$ and baselines on morphological classification (all and challenging subset), molecular status, and survival prediction tasks. The mean uses the same patch encoder as $\textsc{TITAN}$ (CONCHv1.5). Multi-class tasks are evaluated with balanced accuracy, binary tasks with AUROC, and survival tasks with concordance index. For external cohorts (DHMC, CPTAC), the classifier is trained on the corresponding TCGA cohort. All error bars represent standard deviations based on bootstrapping. (d) Ablation study comparing the impact of positional encoding, number of Transformer layers, and inclusion of vision-pretraining stage. The performance is averaged across the four subtyping tasks. (e) Change in performance of $\textsc{TITAN}$ and baselines averaged across the four subtyping tasks for different learning paradigms. For mean pooling and ABMIL, the respective patch encoder for each framework is used. (f) Linear probe few-shot performance $@K$ shots, with $K\in\{1,2,4,8,16\}$, comparing baselines and ABMIL with CONCHv1.5. For each setting, 50 runs were performed. Whiskers extend to data points within 1.5$\times$ the interquartile range. C: number of classes. Ft.: finetune. ABMIL: attention-based multiple instance learning.
  • Figure 2: Linear probe results for molecular classification tasks. (a) Linear models are fitted and evaluated on binary molecular status predictions for BCNB and MUT-HET. (b) Linear models are fitted and evaluated on five fold-splits on TCGA, (c) the same models are evaluated on the corresponding external datasets from CPTAC and EBRAINS. (d) 6-level ER and PR prediction from immunohistochemistry (IHC) slides from Mass General Hospital (MGH). (e) molecular classification tasks for BRCA and LUAD from Mass General Brigham (MGB).
  • Figure 3: Visual-language evaluation of $\textsc{TITAN}$.(a) A schematic for zero-shot evaluation. The query slide is classified by identifying the closest text prompt embedding in the slide embedding space. (b) Zero-shot performance of $\textsc{TITAN}$ and PRISM. All multi-class tasks are evaluated with balanced accuracy and binary tasks are evaluated with AUROC. All error bars represent standard deviations based on bootstrapping. (c) Ablation study comparing different pretraining strategies, and assessed with zero-shot performance averaged across TCGA-UT-8K, TCGA-OT, OT108, and EBRAINS. Evaluations are based on the percentage changes of balanced accuracy from the reference zero-shot performance of $\textsc{TITAN}$. (d) Report generation evaluation on TCGA-Slide-Reports, and evaluated using METEOR, ROGUE, and BLEU. (e) TCGA examples of generated reports of $\textsc{TITAN}$ and PRISM, with the corresponding clinical reports. Additional examples of generated reports are available in Extended Data \ref{['fig:edf_reports']}. C: number of classes.
  • ...and 5 more figures