Towards Large-Scale Training of Pathology Foundation Models

kaiko. ai; Nanne Aben; Edwin D. de Jong; Ioannis Gatopoulos; Nicolas Känzig; Mikhail Karasikov; Axel Lagré; Roman Moser; Joost van Doorn; Fei Tang

Towards Large-Scale Training of Pathology Foundation Models

kaiko. ai, Nanne Aben, Edwin D. de Jong, Ioannis Gatopoulos, Nicolas Känzig, Mikhail Karasikov, Axel Lagré, Roman Moser, Joost van Doorn, Fei Tang

TL;DR

This work presents a scalable pipeline for large-scale pathology foundation models using Online Patching to dynamically sample patches from WSIs and a standardized evaluation framework (eva) for fair cross-model comparisons. Through experiments on TCGA with DINO and DINOv2, the authors show that pretraining on ImageNet speeds convergence, and that training with multiple magnifications improves robustness, with data diversity being crucial for out-of-distribution generalization. They introduce unsupervised metrics (RankMe and ODCorr) that correlate with downstream performance and release both the models and eva for broader adoption. Overall, the approach enables scalable, reproducible development and evaluation of pathology FMs across diverse downstream tasks.

Abstract

Driven by the recent advances in deep learning methods and, in particular, by the development of modern self-supervised learning algorithms, increased interest and efforts have been devoted to build foundation models (FMs) for medical images. In this work, we present our scalable training pipeline for large pathology imaging data, and a comprehensive analysis of various hyperparameter choices and training techniques for building pathology FMs. We release and make publicly available the first batch of our pathology FMs (https://github.com/kaiko-ai/towards_large_pathology_fms) trained on open-access TCGA whole slide images, a commonly used collection of pathology images. The experimental evaluation shows that our models reach state-of-the-art performance on various patch-level downstream tasks, ranging from breast cancer subtyping to colorectal nuclear segmentation. Finally, to unify the evaluation approaches used in the field and to simplify future comparisons of different FMs, we present an open-source framework (https://github.com/kaiko-ai/eva) designed for the consistent evaluation of pathology FMs across various downstream tasks.

Towards Large-Scale Training of Pathology Foundation Models

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 6 figures, 8 tables)

This paper contains 24 sections, 1 equation, 6 figures, 8 tables.

INTRODUCTION
Results and discussion
Training state-of-the-art FMs with online patching
Starting from FMs pre-trained on ImageNet yields faster convergence
Training FMs at multiple magnifications improves robustness
The effect of training data size
The number of training WSIs
The number of distinct patches
Conclusion
Methods
Data
Online high-throughput loading of patches from WSIs
Pretraining setup
Evaluation setup
Unsupervised metrics
...and 9 more sections

Figures (6)

Figure 1: Validation performance over the course of training a ViT-S16 initialized with random weights (blue) and from a model pre-trained on ImageNet (orange) with DINO. Left: Linear probing performance on BACH. Center: Linear probing performance on PCam, Right: Linear probing performance on TP53.
Figure 2: Validation performance of ViT-S16 throughout the DINO training for 100 epochs on the full TCGA dataset and its random 1%, 10%, 30%, and 100% subsets of WSIs (left) and for different numbers of distinct training patches sampled at random coordinates from random WSIs of 100% TCGA (right). 'inf' represents the training where all training patches are distinct and are sampled from random coordinates. The performance is measured with linear probing on the PCam/val downstream task.
Figure 3: Quantitative results for the semantic segmentation task. Left: the patch. Center: the semantic segmentation labels. Right: the predictions with our ViT-B8 model.
Figure 4: Validation performance over the course of training a ViT-S16 model using DINOv2 (orange) and DINO (blue). Left: Off-diagonal correlation on randomly selected TCGA patches. Center: Linear probing performance on test split of BACH dataset. Right: Linear probing performance on validation split of PCam dataset. The orange curves for DINOv2 show a range from 4 different runs with different learning rates, while the blue curves show one single run with DINO using the standard setting, details can be found in Section \ref{['appendix_eval']}.
Figure 5: Correlation between ODCorr and top-1 accuracy of the representation on CIFAR-10 (left), CIFAR-100 (center) and Food101 (right). An inverse correlation between the ODCorr and top-1 accuracy can be observed.
...and 1 more figures

Towards Large-Scale Training of Pathology Foundation Models

TL;DR

Abstract

Towards Large-Scale Training of Pathology Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)