Table of Contents
Fetching ...

Virchow: A Million-Slide Digital Pathology Foundation Model

Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Siqi Liu, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, Philippe Mathieu, Alexander van Eck, Donghun Lee, Julian Viret, Eric Robert, Yi Kan Wang, Jeremy D. Kunz, Matthew C. H. Lee, Jan Bernhard, Ran A. Godrich, Gerard Oakley, Ewan Millar, Matthew Hanna, Juan Retamero, William A. Moye, Razik Yousfi, Christopher Kanan, David Klimstra, Brandon Rothrock, Thomas J. Fuchs

TL;DR

Virchow introduces a million-scale pathology foundation model trained with DINOv2 on 1.5 million HE-stained WSIs, enabling robust tile embeddings for broad diagnostics. The study demonstrates state-of-the-art pan-cancer detection (AUROC 0.949) and rare-cancer performance (0.937), along with strong cross-tissue biomarker prediction and tile-level benchmarks. The results highlight the value of massive pathology data and domain-specific self-supervised pretraining, while acknowledging limitations such as single-center data and the need for aggregation. The work suggests that scaling data and architecture can meaningfully improve downstream computational pathology tasks, with broad implications for clinical decision support and biomarker discovery.

Abstract

The use of artificial intelligence to enable precision medicine and decision support systems through the analysis of pathology images has the potential to revolutionize the diagnosis and treatment of cancer. Such applications will depend on models' abilities to capture the diverse patterns observed in pathology images. To address this challenge, we present Virchow, a foundation model for computational pathology. Using self-supervised learning empowered by the DINOv2 algorithm, Virchow is a vision transformer model with 632 million parameters trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue and specimen types, which is orders of magnitude more data than previous works. The Virchow model enables the development of a pan-cancer detection system with 0.949 overall specimen-level AUC across 17 different cancer types, while also achieving 0.937 AUC on 7 rare cancer types. The Virchow model sets the state-of-the-art on the internal and external image tile level benchmarks and slide level biomarker prediction tasks. The gains in performance highlight the importance of training on massive pathology image datasets, suggesting scaling up the data and network architecture can improve the accuracy for many high-impact computational pathology applications where limited amounts of training data are available.

Virchow: A Million-Slide Digital Pathology Foundation Model

TL;DR

Virchow introduces a million-scale pathology foundation model trained with DINOv2 on 1.5 million HE-stained WSIs, enabling robust tile embeddings for broad diagnostics. The study demonstrates state-of-the-art pan-cancer detection (AUROC 0.949) and rare-cancer performance (0.937), along with strong cross-tissue biomarker prediction and tile-level benchmarks. The results highlight the value of massive pathology data and domain-specific self-supervised pretraining, while acknowledging limitations such as single-center data and the need for aggregation. The work suggests that scaling data and architecture can meaningfully improve downstream computational pathology tasks, with broad implications for clinical decision support and biomarker discovery.

Abstract

The use of artificial intelligence to enable precision medicine and decision support systems through the analysis of pathology images has the potential to revolutionize the diagnosis and treatment of cancer. Such applications will depend on models' abilities to capture the diverse patterns observed in pathology images. To address this challenge, we present Virchow, a foundation model for computational pathology. Using self-supervised learning empowered by the DINOv2 algorithm, Virchow is a vision transformer model with 632 million parameters trained on 1.5 million hematoxylin and eosin stained whole slide images from diverse tissue and specimen types, which is orders of magnitude more data than previous works. The Virchow model enables the development of a pan-cancer detection system with 0.949 overall specimen-level AUC across 17 different cancer types, while also achieving 0.937 AUC on 7 rare cancer types. The Virchow model sets the state-of-the-art on the internal and external image tile level benchmarks and slide level biomarker prediction tasks. The gains in performance highlight the importance of training on massive pathology image datasets, suggesting scaling up the data and network architecture can improve the accuracy for many high-impact computational pathology applications where limited amounts of training data are available.
Paper Structure (23 sections, 6 figures, 7 tables)

This paper contains 23 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of the training dataset (a-d), training algorithm (e), and application (f) of Virchow, a foundation model for digital pathology. a. The training data can be described in terms of patients, cases, specimens, blocks or slides as shown. (b-d) The slide distribution as a function of cancer status (b), surgery (c), and tissue type (d). e. The dataflow during training which requires processing the slide into tiles, which are then cropped into global and local views. f. Schematic of applications of the foundation model using an aggregator model to predict attributes at the slide level.
  • Figure 2: Pan-cancer detection results (a-c). Detection is specimen-level, produced with an aggregator network trained on Virchow, Phikon, or CTransPath tile embeddings. a. Cancer prediction performance (AUROC) stratified by cancer type as determined by origin tissue ("H&N" is head and neck). The incidence rate of each cancer is shown. Virchow embeddings enable the best cancer detection performance across all cancer types and performance remains robust on rare cancers. For each cancer type, the AUROC corresponding to the statistically significantly (p < 0.05) top performing embeddings is highlighted in magenta. When more than one AUROC is not gray, performance is "tied” (no statistically significant difference) b.ROC curves showing the overall pan-cancer detection performance, as well as performance stratified across internal MSKCC data vs. data coming from diverse external institutions. All evaluation data is withheld from training. c. Sensitivity at 95% specificity for rare cancer detection (* p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001). d. Half of the specimens come from diverse external institutions (OOD data). e. ID vs. OOD tissues in the evaluation dataset. Some of the OOD tissues arise from cancer metastases.
  • Figure 3: A summary of tile-level linear probing. a. The number of tasks in which each model scored in the top-x. b. A description of each task. c. The weighted F1 score for each of the six models and six tasks. d. Virchow discovers cells in the consep dataset: malignant epithelium (red), miscellaneous (yellow), and inflammatory (magenta) cells.
  • Figure 11: Distributions of cancer and benign tiles in the PanMSK dataset. The splits are balanced such that each tissue group approximately follows the same 7:1:2 (training:validation:testing) ratios in both tiles and slides counts.
  • Figure 12: Schematic of DINOv2 training routine. From a single tile, 2 global crops and 8 local crops all with random augmentations are created. The global crops are randomly masked and fed to the student model while the unmasked versions are fed to the teacher model. The student tries to produce a global representation of the views (via the cls token) that matches the teacher's representation of the opposite view. The student also tries to produce representations of the masked image tokens that match the teacher's representations of the same tokens but unmasked. The local crops are only fed to the student which tries to produce a representation that matches the teacher's representations of the global crops. The teacher is an EMA copy of the student.
  • ...and 1 more figures