Table of Contents
Fetching ...

Phikon-v2, A large and public feature extractor for biomarker prediction

Alexandre Filiot, Paul Jacob, Alice Mac Kain, Charlie Saillard

TL;DR

The paper addresses the need for large-scale, publicly accessible histopathology encoders for biomarker prediction. It trains Phikon-v2, a ViT-L model using DINOv2 on 460M tiles from the PANCAN-XL dataset and benchmarks it against 14 encoders across eight external downstream tasks using MIL-based aggregation and ensembling. Key findings show ViT-L+ models, especially Phikon-v2 and GigaPath, perform strongly across most tasks, though specialized models can excel on specific biomarkers; ensembling provides substantial, consistent gains (+1.75 AUC). The work emphasizes open data and models, while acknowledging scalability has limits and advocating further organ- and task-specific fine-tuning for clinical deployment.

Abstract

Gathering histopathology slides from over 100 publicly available cohorts, we compile a diverse dataset of 460 million pathology tiles covering more than 30 cancer sites. Using this dataset, we train a large self-supervised vision transformer using DINOv2 and publicly release one iteration of this model for further experimentation, coined Phikon-v2. While trained on publicly available histology slides, Phikon-v2 surpasses our previously released model (Phikon) and performs on par with other histopathology foundation models (FM) trained on proprietary data. Our benchmarks include eight slide-level tasks with results reported on external validation cohorts avoiding any data contamination between pre-training and evaluation datasets. Our downstream training procedure follows a simple yet robust ensembling strategy yielding a +1.75 AUC increase across tasks and models compared to one-shot retraining (p<0.001). We compare Phikon (ViT-B) and Phikon-v2 (ViT-L) against 14 different histology feature extractors, making our evaluation the most comprehensive to date. Our result support evidences that DINOv2 handles joint model and data scaling better than iBOT. Also, we show that recent scaling efforts are overall beneficial to downstream performance in the context of biomarker prediction with GigaPath and H-Optimus-0 (two ViT-g with 1.1B parameters each) standing out. However, the statistical margins between the latest top-performing FMs remain mostly non-significant; some even underperform on specific indications or tasks such as MSI prediction - deposed by a 13x smaller model developed internally. While latest foundation models may exhibit limitations for clinical deployment, they nonetheless offer excellent grounds for the development of more specialized and cost-efficient histology encoders fueling AI-guided diagnostic tools.

Phikon-v2, A large and public feature extractor for biomarker prediction

TL;DR

The paper addresses the need for large-scale, publicly accessible histopathology encoders for biomarker prediction. It trains Phikon-v2, a ViT-L model using DINOv2 on 460M tiles from the PANCAN-XL dataset and benchmarks it against 14 encoders across eight external downstream tasks using MIL-based aggregation and ensembling. Key findings show ViT-L+ models, especially Phikon-v2 and GigaPath, perform strongly across most tasks, though specialized models can excel on specific biomarkers; ensembling provides substantial, consistent gains (+1.75 AUC). The work emphasizes open data and models, while acknowledging scalability has limits and advocating further organ- and task-specific fine-tuning for clinical deployment.

Abstract

Gathering histopathology slides from over 100 publicly available cohorts, we compile a diverse dataset of 460 million pathology tiles covering more than 30 cancer sites. Using this dataset, we train a large self-supervised vision transformer using DINOv2 and publicly release one iteration of this model for further experimentation, coined Phikon-v2. While trained on publicly available histology slides, Phikon-v2 surpasses our previously released model (Phikon) and performs on par with other histopathology foundation models (FM) trained on proprietary data. Our benchmarks include eight slide-level tasks with results reported on external validation cohorts avoiding any data contamination between pre-training and evaluation datasets. Our downstream training procedure follows a simple yet robust ensembling strategy yielding a +1.75 AUC increase across tasks and models compared to one-shot retraining (p<0.001). We compare Phikon (ViT-B) and Phikon-v2 (ViT-L) against 14 different histology feature extractors, making our evaluation the most comprehensive to date. Our result support evidences that DINOv2 handles joint model and data scaling better than iBOT. Also, we show that recent scaling efforts are overall beneficial to downstream performance in the context of biomarker prediction with GigaPath and H-Optimus-0 (two ViT-g with 1.1B parameters each) standing out. However, the statistical margins between the latest top-performing FMs remain mostly non-significant; some even underperform on specific indications or tasks such as MSI prediction - deposed by a 13x smaller model developed internally. While latest foundation models may exhibit limitations for clinical deployment, they nonetheless offer excellent grounds for the development of more specialized and cost-efficient histology encoders fueling AI-guided diagnostic tools.
Paper Structure (15 sections, 2 figures, 6 tables)

This paper contains 15 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Figure 2: Distribution of tissue sites in PANCAN-XL pre-training dataset. See Extended Table \ref{['table:pancanxl_details']} for details.
  • Figure 2: Figure 2: Comparison of ensembling performance against one-shot retraining. For each task, we compute the median AUC and 95% confidence intervals across 10,000 repeats based on the ensembling of the 25 models' predictions issued from cross-validation (i.e., statistics over one predictions distribution, "Ensembling", blue). We compare this metric to the average (and standard deviation) of the 25 AUCs taken from each individual model without ensembling (i.e., statistics over 25 AUCs distribution, "Average", red). We eventually compare it to the one-shot retraining strategy, consisting in training a unique model on the whole training dataset for a number of epochs optimized during cross-validation ("Retraining", black dots). Reading note: "(+1.0)" indicates that ensembling yields a +1.0 AUC gain over retraining, averaged across all tasks.