Table of Contents
Fetching ...

PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue

Eugene Vorontsov, George Shaikovski, Adam Casson, Julian Viret, Eric Zimmermann, Neil Tenenholtz, Yi Kan Wang, Jan H. Bernhard, Ran A. Godrich, Juan A. Retamero, Jinru Shia, Mithat Gonen, Martin R. Weiser, David S. Klimstra, Razik Yousfi, Nicolo Fusi, Thomas J. Fuchs, Kristen Severson, Siqi Liu

TL;DR

PRISM2 addresses the need for slide-level, generalizable pathology representations by aligning histomorphology with diagnostic language through clinical-dialogue supervision. It introduces a slide-level multimodal foundation model that yields two embedding types (base and diagnostic) via a two-stage training pipeline integrating a perceiver-based slide encoder, BioGPT text encoder, and Phi-3 Mini LLM, trained on a large corpus of specimen-report pairs and QA data. The approach achieves clinical-grade cancer detection with direct QA without task-specific fine-tuning, and demonstrates strong transfer to biomarker and survival tasks, plus the ability to complete CAP-style pathology reports. Overall, the work shows language-guided pretraining as a scalable, clinically grounded signal that bridges human diagnostic reasoning and foundation-model performance, with potential to enhance diagnostic workflows and prognostic assessments in pathology.

Abstract

Recent rapid progress in the field of computational pathology has been enabled by foundation models. These models are beginning to move beyond encoding image patches towards whole-slide understanding but their clinical utility remains limited. In this work, we present PRISM2, a multimodal slide-level foundation model trained on data from 700,000 diagnostic specimen-report pairs, the largest vision (2.3 million whole slide images) and language (14M question-answer pairs) histopathology dataset to date. By learning through clinical-dialogue supervision, PRISM2 aligns histomorphologic features with the language of diagnostic reasoning, producing slide-level representations that support both direct diagnostic question-answering and transferable embeddings for downstream tasks. Without additional training, PRISM2 matches or exceeds the cancer-detection performance of clinical-grade products. This is observed without loss of generality on other tasks, where PRISM2 achieves top performance. Finally, using survival prediction as the example, we show that task-specific finetuning with a large dataset can outperform task-specific models, further improving performance. These results demonstrate how language-supervised pretraining provides a scalable, clinically grounded signal for learning generalizable pathology representations, bridging human diagnostic reasoning and foundation-model performance.

PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue

TL;DR

PRISM2 addresses the need for slide-level, generalizable pathology representations by aligning histomorphology with diagnostic language through clinical-dialogue supervision. It introduces a slide-level multimodal foundation model that yields two embedding types (base and diagnostic) via a two-stage training pipeline integrating a perceiver-based slide encoder, BioGPT text encoder, and Phi-3 Mini LLM, trained on a large corpus of specimen-report pairs and QA data. The approach achieves clinical-grade cancer detection with direct QA without task-specific fine-tuning, and demonstrates strong transfer to biomarker and survival tasks, plus the ability to complete CAP-style pathology reports. Overall, the work shows language-guided pretraining as a scalable, clinically grounded signal that bridges human diagnostic reasoning and foundation-model performance, with potential to enhance diagnostic workflows and prognostic assessments in pathology.

Abstract

Recent rapid progress in the field of computational pathology has been enabled by foundation models. These models are beginning to move beyond encoding image patches towards whole-slide understanding but their clinical utility remains limited. In this work, we present PRISM2, a multimodal slide-level foundation model trained on data from 700,000 diagnostic specimen-report pairs, the largest vision (2.3 million whole slide images) and language (14M question-answer pairs) histopathology dataset to date. By learning through clinical-dialogue supervision, PRISM2 aligns histomorphologic features with the language of diagnostic reasoning, producing slide-level representations that support both direct diagnostic question-answering and transferable embeddings for downstream tasks. Without additional training, PRISM2 matches or exceeds the cancer-detection performance of clinical-grade products. This is observed without loss of generality on other tasks, where PRISM2 achieves top performance. Finally, using survival prediction as the example, we show that task-specific finetuning with a large dataset can outperform task-specific models, further improving performance. These results demonstrate how language-supervised pretraining provides a scalable, clinically grounded signal for learning generalizable pathology representations, bridging human diagnostic reasoning and foundation-model performance.

Paper Structure

This paper contains 23 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of PRISM2, a slide-level foundation model capable of producing generalizable embeddings as well as direct prediction with clinical dialogue. a The input to PRISM2 is one or more WSI associated with a patient-case. These large tissue images are cropped into tiles which are subsequently processed with the Virchow2 foundation model. The sequence of tile embeddings is then aggregated via a slide encoder into a base image representation ("Base Embedding"). The base embedding is then used as input, along with text, to a large vision-language model which produces a diagnostic representation ("Diagnostic Embedding") and text. The diagnostic embedding is tailored to diagnostic tasks, such as the detection and identification of cancers, precursors to cancers, and benign conditions. Base embeddings are more general and are well suited for transfer learning to tasks outside of the diagnosis-focused training distribution, such as biomarker and survival prediction. b Examples of the four types of direct dialogue-based prediction templates: report generation, open-ended QA, yes-no QA, and multiple-choice QA. Yes-no and multiple-choice QA in particular enable directly querying the model for quantifiable predictions with probabilities that can be calibrated.
  • Figure 1: The architecture and training schematic for PRISM2. Sets of whole slide images are summarized as a base embedding (Attention Pooler class token) which is aligned to the representation of the clinical report diagnostic summary (encoded by the language decoder, BioGPT) by minimizing a contrastive loss. The image latent features are also passed to a pretrained large language model (LLM, Phi-3 Mini) along with a prompt from one of four dialogue templates, with the model's response updated by minimizing the dialogue loss. Training proceeds in two stages: in stage 1, only the slide encoder, attention pooler, language decoder, and LLM adapter weights are updated; in stage 2, only the LLM adapter and LLM weights are updated. For use in downstream tasks, the latent image features produced by the slide encoder are summarized in two embeddings: (1) the features are pooled into a 'base embedding', used for contrastive alignment; and (2) the features are refined by the LLM into a 'diagnostic embedding', taken from the LLM hidden state at the <|assistant|> token, after inputting the image.
  • Figure 2: Overview of PRISM2 training data. a The training data can be described in terms of patients, cases, specimens, blocks, and slides, as shown. Clinical reports are paired at the level of specimens. Consequently, PRISM2 is trained at the specimen level to predict diagnostic findings in the clinical reports. b The distribution of tissues represented in the training data. c The distribution of specimens and WSI between MSK samples and those submitted for second opinion from diverse external sources around the world. d Distribution of high-level conditions in the training data. e Overview of how dialogue examples are generated from clinical reports to be used during PRISM2 training. Every processing step (arrow) uses GPT-4o. Clinical report rewrites are split into clinical history and pathology diagnosis. Diagnostic summaries are produced without demographic, molecular, and sectioning information. Diagnostic findings are first listed, and then converted into yes-no and open-ended question-answer pairs. As the yes-no question distribution is biased toward the findings mentioned in clinical reports, we mine additional complementary yes-no question-answer pairs by assuming that findings which are not mentioned are negative in the specimen. Multiple choice QA is derived directly from the reports, following the CAP reporting guidelines.
  • Figure 2: This figure demonstrates the examples of PRISM2 dialogues. a A breast specimen with IDC. We show WSI in the initial message for illustrative purposes, but for each example PRISM2 views all the slides of the specimen at once. b A bladder biopsy. Despite being trained on simple questions, the model is able to accurately respond to compound questions. c A uterus specimen. The model correctly identifies and characterizes the carcinoma and additional findings except it incorrectly predicts FIGO Grade 1 instead of Grade 2. d A liver specimen from a patient with metastatic colorectal cancer. The model is able to be given additional context like some patient history and specimen description and detects the carcinoma and correctly predicts its origin and grade, but incorrectly says there are no treatment related changes.
  • Figure 3: Direct prediction through dialogue with PRISM2. a Yes-no QA matches or exceeds the performance of existing clinical-grade products for the detection of invasive cancer as compared to Paige Prostate, Paige Breast, and Paige Breast Lymph Node (BLN), when evaluated on the corresponding product testing datasets. Yes-no QA on pan-cancer and rare variants exceeds the contrastive performance of competing approaches. After training a linear classifier using the diagnostic embedding as input on a large pan-cancer dataset, PRISM2 pan-cancer performance improves, and continues to outperform competing approaches with similar adaptation. b The size of the pan-cancer training set used for linear adaptation is similar to the training sets of the three clinical models; however, prostate (light blue), breast (dark blue), and BLN (yellow) tissues are a relatively small subset of the pan-cancer dataset. c Direct prediction performs well on TCGA cancer subtyping with multiple-choice QA. n indicates number of samples; $+$ and $-$ indicate number of positive and negative samples, respectively. Error bars show the 95% confidence interval computed with bootstrapping. * direct prediction result tied for first place (p < 0.05, permutation test).
  • ...and 11 more figures