Table of Contents
Fetching ...

Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts

Fredrik K. Gustafsson, Mattias Rantalainen

TL;DR

This work evaluates two computational pathology foundation models: UNI and CONCH (trained on more than 100,000 whole-slide images) and CONCH (trained on more than 1.1 million image-caption pairs), by utilizing them as feature extractors within prostate cancer grading models.

Abstract

Foundation models have recently become a popular research direction within computational pathology. They are intended to be general-purpose feature extractors, promising to achieve good performance on a range of downstream tasks. Real-world pathology image data does however exhibit considerable variability. Foundation models should be robust to these variations and other distribution shifts which might be encountered in practice. We evaluate two computational pathology foundation models: UNI (trained on more than 100,000 whole-slide images) and CONCH (trained on more than 1.1 million image-caption pairs), by utilizing them as feature extractors within prostate cancer grading models. We find that while UNI and CONCH perform well relative to baselines, the absolute performance can still be far from satisfactory in certain settings. The fact that foundation models have been trained on large and varied datasets does not guarantee that downstream models always will be robust to common distribution shifts.

Evaluating Computational Pathology Foundation Models for Prostate Cancer Grading under Distribution Shifts

TL;DR

This work evaluates two computational pathology foundation models: UNI and CONCH (trained on more than 100,000 whole-slide images) and CONCH (trained on more than 1.1 million image-caption pairs), by utilizing them as feature extractors within prostate cancer grading models.

Abstract

Foundation models have recently become a popular research direction within computational pathology. They are intended to be general-purpose feature extractors, promising to achieve good performance on a range of downstream tasks. Real-world pathology image data does however exhibit considerable variability. Foundation models should be robust to these variations and other distribution shifts which might be encountered in practice. We evaluate two computational pathology foundation models: UNI (trained on more than 100,000 whole-slide images) and CONCH (trained on more than 1.1 million image-caption pairs), by utilizing them as feature extractors within prostate cancer grading models. We find that while UNI and CONCH perform well relative to baselines, the absolute performance can still be far from satisfactory in certain settings. The fact that foundation models have been trained on large and varied datasets does not guarantee that downstream models always will be robust to common distribution shifts.

Paper Structure

This paper contains 7 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Performance comparison of UNI, CONCH and Resnet-IN across different PANDA subsets, when utilized as patch-level feature extractors in the ABMIL(top), Mean Feature(middle) or kNN(bottom) ISUP grade classification models. All results are mean$\pm$std (standard deviation) over $10$ random cross-validation folds. Raw numerical results for this figure are provided in Table \ref{['tab:main_results_uni_conch_resnet-in']} in the supplementary material. The same model performance comparison but in terms of MAE instead of kappa is also found in Figure \ref{['fig:main_results_uni_conch_resnet-in_mae']}.
  • Figure 2: Performance comparison of the ISUP grade models ABMIL, Mean Feature and kNN, when utilizing UNI (top), CONCH (middle) or Resnet-IN (bottom) as patch-level feature extractors. This figure contains the same results as Figure \ref{['fig:main_results_uni_conch_resnet-in_kappa']}, but presented to enable a direct comparison of the ISUP grade models. All results are mean$\pm$std over $10$ random cross-validation folds. Raw numerical results for this figure are provided in Table \ref{['tab:main_results_attmil_mean-pool_knn']}. The same model performance comparison but in terms of MAE instead of kappa is also found in Figure \ref{['fig:main_results_attmil_mean-pool_knn_mae']}.
  • Figure 3: Top: Detailed performance comparison of UNI, CONCH and Resnet-IN, when utilized as patch-level feature extractors in the ABMIL ISUP grade model. Bottom: Detailed performance comparison of the three ISUP grade models ABMIL, Mean Feature and kNN, when utilizing UNI as the patch-level feature extractor. All results are mean$\pm$std over $10$ random cross-validation folds.
  • Figure 4: We study robustness in terms of two common types of distribution shifts: (a): Shifts in the WSI image data (visualization of $2500$ randomly sampled patches from Radboud and Karolinska). (b) & (c): Shifts in the label distribution over the ISUP grades $0 - 5$.
  • Figure 5: Overview of the three evaluated ISUP grade classification models: Top:ABMIL. Middle:Mean Feature. Bottom:kNN. All three models utilize the same initial WSI processing steps. First, the input prostate biopsy WSI $x$ is tissue-segmented and divided into non-overlapping patches $\tilde{x}_i$ of size $256 \times 256$ using CLAM lu2021data. Next, a feature vector $p(\tilde{x}_i)$ is extracted for each patch, using a pretrained and frozen feature extractor (either UNI chen2024uni, CONCH conch2024 or Resnet-IN). The different models then process these patch-level feature vectors $p(\tilde{x}_i)$ further (see the Methods section for details), finally outputting a predicted ISUP grade $\hat{y}(x) \in \{0, \dots, 5\}$. In all three figures, blue marks the pretrained and frozen patch-level feature extractor, whereas green marks trainable model components.
  • ...and 2 more figures