Multimodal Whole Slide Foundation Model for Pathology
Tong Ding, Sophia J. Wagner, Andrew H. Song, Richard J. Chen, Ming Y. Lu, Andrew Zhang, Anurag J. Vaidya, Guillaume Jaume, Muhammad Shaban, Ahrong Kim, Drew F. K. Williamson, Bowen Chen, Cristina Almagro-Perez, Paul Doucet, Sharifa Sahai, Chengkuan Chen, Daisuke Komura, Akihiro Kawabe, Shumpei Ishikawa, Georg Gerber, Tingying Peng, Long Phi Le, Faisal Mahmood
TL;DR
We introduce TITAN, a Transformer-based multimodal whole-slide foundation model for pathology that yields general-purpose slide embeddings and cross-modal capabilities. Pretrained on 335,645 WSIs and 182,862 medical reports, TITAN leverages 423,122 synthetic ROI captions (PathChat) and 183K slide-level reports across three-stage vision-language training to align ROI and slide representations with text. Its slide encoder uses a ViT with 2D ALiBi to enable long-context extrapolation from ROI blocks to full WSIs, and its unimodal vision pretraining is complemented by ROI-caption and slide-report alignment. Across a broad suite of tasks—including morphological subtyping, molecular classification, survival, rare-cancer retrieval, cross-modal retrieval, and pathology report generation—TITAN consistently outperforms ROI- and slide-based foundation models, exhibiting strong zero-shot and few-shot performance and practical multimodal capabilities.
Abstract
The field of computational pathology has been transformed with recent advances in foundation models that encode histopathology region-of-interests (ROIs) into versatile and transferable feature representations via self-supervised learning (SSL). However, translating these advancements to address complex clinical challenges at the patient and slide level remains constrained by limited clinical data in disease-specific cohorts, especially for rare clinical conditions. We propose TITAN, a multimodal whole slide foundation model pretrained using 335,645 WSIs via visual self-supervised learning and vision-language alignment with corresponding pathology reports and 423,122 synthetic captions generated from a multimodal generative AI copilot for pathology. Without any finetuning or requiring clinical labels, TITAN can extract general-purpose slide representations and generate pathology reports that generalize to resource-limited clinical scenarios such as rare disease retrieval and cancer prognosis. We evaluate TITAN on diverse clinical tasks and find that TITAN outperforms both ROI and slide foundation models across machine learning settings such as linear probing, few-shot and zero-shot classification, rare cancer retrieval and cross-modal retrieval, and pathology report generation.
