Table of Contents
Fetching ...

PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology

George Shaikovski, Adam Casson, Kristen Severson, Eric Zimmermann, Yi Kan Wang, Jeremy D. Kunz, Juan A. Retamero, Gerard Oakley, David Klimstra, Christopher Kanan, Matthew Hanna, Michal Zelechowski, Julian Viret, Neil Tenenholtz, James Hall, Nicolo Fusi, Razik Yousfi, Peter Hamilton, William A. Moye, Eugene Vorontsov, Siqi Liu, Thomas J. Fuchs

TL;DR

PRISM presents a slide-level foundation approach for histopathology by integrating Virchow tile embeddings with WSI-level clinical report supervision via a memory-efficient Perceiver encoder and a BioGPT language decoder. The model enables zero-shot disease detection, cancer sub-typing, and text-based report generation, while also supporting label-efficient biomarker prediction through fine-tuning. Across extensive WSI-scale data, PRISM achieves competitive zero-shot and superior data-efficiency performance compared to fully supervised aggregators, and offers interpretable predictions via attended tiles and generated reports. The work demonstrates a scalable, multi-modal framework that aligns image content with clinical narratives, potentially accelerating translation of slide-level decision support across pathology practice.

Abstract

Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.

PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology

TL;DR

PRISM presents a slide-level foundation approach for histopathology by integrating Virchow tile embeddings with WSI-level clinical report supervision via a memory-efficient Perceiver encoder and a BioGPT language decoder. The model enables zero-shot disease detection, cancer sub-typing, and text-based report generation, while also supporting label-efficient biomarker prediction through fine-tuning. Across extensive WSI-scale data, PRISM achieves competitive zero-shot and superior data-efficiency performance compared to fully supervised aggregators, and offers interpretable predictions via attended tiles and generated reports. The work demonstrates a scalable, multi-modal framework that aligns image content with clinical narratives, potentially accelerating translation of slide-level decision support across pathology practice.

Abstract

Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.
Paper Structure (24 sections, 4 equations, 8 figures, 8 tables)

This paper contains 24 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: An overview of the capabilities enabled by the slide-level foundation model (PRISM), built on Virchow vorontsov2024virchow tile embeddings. Whereas Virchow produces an embedding for each foreground tile of a set of whole slide images, PRISM aggregates these embeddings into a single slide embedding that can be used for image perception by training a linear classifier for downstream tasks including cancer detection, cancer sub-typing, and biomarker detection. Optionally, the model can be fine-tuned for the classification task. Language-enabled capabilities of PRISM include the training-free "zero-shot" prediction via text prompting, and generation of interpretable free text clinical reports.
  • Figure 2: The training methodology for the slide-level foundation model (PRISM). All trained weights are initialized to random values except for the BioGPT word embeddings. Whole slide images and clinical report latent embeddings are aligned with a contrastive loss. Report generation is trained with a generative loss using teacher forcing. Layers 13-24 of the BioGPT decoder are modified to cross-attend to vision embeddings.
  • Figure 3: Statistics on the specimen-level pre-training dataset for PRISM. Note that a specimen may contain one or more WSI. a. Distribution of specimens by the site of tissue origin. b. Proportion of data with the most severe diagnosis being cancer, precursor to caner, or benign. Note for example that a specimen with cancer may also have a precursor to cancer. c. The histogram of tile counts per specimen. 85% of the specimens (195,344 specimens) have fewer than 100 thousand tiles (plots a and b describe this subset).
  • Figure 4: t-SNE plots of slide embeddings for cancer sub-typing datasets. IDC and ILC are types of breast cancer. LUSC and LUAD are types of non-small cell lung cancer. DCIS is an early stage non-invasive breast cancer and a precursor to IDC; some IDC slides can contain DCIS regions, however IDC takes precedence in the diagnosis as a higher stage cancer. All three plots suggest distinct clusters of slide embeddings in higher dimensions along the cancer sub-types labels.
  • Figure 5: Fine-tuning pre-trained slide encoder (PRISM Perceiver) improves data efficiency compared to training the same model from scratch. The mean (standard deviation) performance across 3 experimental runs is plotted as a solid line (shaded region), relative to the highest AUROC achieved without pre-training. The subset fraction denotes the fraction of the available training data used to fine-tune or train the model. A different random subset is selected with each experimental run. The vertical dashed line (magenta) denotes the minimal subset fraction required when fine-tuning PRISM to reach at least 99.5% of the AUROC that can be achieved without pre-training. Note that in many cases, pre-training yields better performance than can be achieved when training from scratch.
  • ...and 3 more figures