Table of Contents
Fetching ...

Training state-of-the-art pathology foundation models with orders of magnitude less data

Mikhail Karasikov, Joost van Doorn, Nicolas Känzig, Melis Erdal Cesur, Hugo Mark Horlings, Robert Berke, Fei Tang, Sebastian Otálora

TL;DR

This work challenges the assumption that pathology foundation models require massive unlabeled image collections. By modifying the DINOv2 SSL pipeline (including KDE regularization, HSV filtering, and HED augmentations) and applying a novel high‑resolution post‑training step, the authors train three PFMs on as few as 12k–92k WSIs and achieve competitive or superior performance on multiple downstream tasks. The Midnight series (Midnight‑12k, Midnight‑92k, Midnight‑92k/392) reach or exceed state‑of‑the‑art benchmarks such as Virchow2 and UNI‑2, with notable gains in segmentation and a strong showing on PCam 10‑shot. Ablation studies reveal the contribution of each modification, while qualitative segmentation and slide‑level tasks demonstrate practical benefits for clinical histopathology, though some tasks still require further optimization. Overall, the results indicate substantial potential to advance pathology FMs with much less data than previously thought, enabling broader research and potential clinic impact.

Abstract

The field of computational pathology has recently seen rapid advances driven by the development of modern vision foundation models (FMs), typically trained on vast collections of pathology images. Recent studies demonstrate that increasing the training data set and model size and integrating domain-specific image processing techniques can significantly enhance the model's performance on downstream tasks. Building on these insights, our work incorporates several recent modifications to the standard DINOv2 framework from the literature to optimize the training of pathology FMs. We also apply a post-training procedure for fine-tuning models on higher-resolution images to further enrich the information encoded in the embeddings. We present three novel pathology FMs trained on up to two orders of magnitude fewer WSIs than those used to train other state-of-the-art FMs while demonstrating a comparable or superior performance on downstream tasks. Even the model trained on TCGA alone (12k WSIs) outperforms most existing FMs and, on average, matches Virchow2, the second-best FM published to date. This suggests that there still remains a significant potential for further improving the models and algorithms used to train pathology FMs to take full advantage of the vast data collections.

Training state-of-the-art pathology foundation models with orders of magnitude less data

TL;DR

This work challenges the assumption that pathology foundation models require massive unlabeled image collections. By modifying the DINOv2 SSL pipeline (including KDE regularization, HSV filtering, and HED augmentations) and applying a novel high‑resolution post‑training step, the authors train three PFMs on as few as 12k–92k WSIs and achieve competitive or superior performance on multiple downstream tasks. The Midnight series (Midnight‑12k, Midnight‑92k, Midnight‑92k/392) reach or exceed state‑of‑the‑art benchmarks such as Virchow2 and UNI‑2, with notable gains in segmentation and a strong showing on PCam 10‑shot. Ablation studies reveal the contribution of each modification, while qualitative segmentation and slide‑level tasks demonstrate practical benefits for clinical histopathology, though some tasks still require further optimization. Overall, the results indicate substantial potential to advance pathology FMs with much less data than previously thought, enabling broader research and potential clinic impact.

Abstract

The field of computational pathology has recently seen rapid advances driven by the development of modern vision foundation models (FMs), typically trained on vast collections of pathology images. Recent studies demonstrate that increasing the training data set and model size and integrating domain-specific image processing techniques can significantly enhance the model's performance on downstream tasks. Building on these insights, our work incorporates several recent modifications to the standard DINOv2 framework from the literature to optimize the training of pathology FMs. We also apply a post-training procedure for fine-tuning models on higher-resolution images to further enrich the information encoded in the embeddings. We present three novel pathology FMs trained on up to two orders of magnitude fewer WSIs than those used to train other state-of-the-art FMs while demonstrating a comparable or superior performance on downstream tasks. Even the model trained on TCGA alone (12k WSIs) outperforms most existing FMs and, on average, matches Virchow2, the second-best FM published to date. This suggests that there still remains a significant potential for further improving the models and algorithms used to train pathology FMs to take full advantage of the vast data collections.

Paper Structure

This paper contains 17 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Schematic representation of the FM evaluation. Panel A: Evaluation of a vision transformer FM at standard and high resolution. For high resolution, larger tiles of size 392$\times$392 are cropped into $(392/14)^2=784$ patches of the same size 14$\times$14 pixels. (The grids are shown schematically for simplicity. The actual numbers of patches the tiles are cropped into are $256$ and $784$ instead of $4^2$ and $7^2$, as shown in the graph.) Panel B: Aggregating token embeddings produced by the ViT into the final CLS+Mean token embedding.
  • Figure 2: Examples of segmentation performed with different FMs on two tiles from the CoNSeP data set: ViT-g14 (natural images), Lunit, Virchow2, and our models Midnight-12k and Midnight-92k/392. Ground truth is shown on the left side --- green: inflammatory, blue: epithelial, yellow: spindle-shaped nuclei.
  • Figure 3: Tile predictions for Lunit, Virchow2, and Midnight-12k and ground truth annotations for slide test_040 from Camelyon16.
  • Figure A1: Left: Random tiles cropped from TCGA FFPE slides at 0.5µm/px that passed the HSV filter and those filtered out. Right: Random HED augmentations applied to a single tile sampled from the BACH data set.