Table of Contents
Fetching ...

The Importance of Downstream Networks in Digital Pathology Foundation Models

Gustav Bredell, Marcel Fischer, Przemyslaw Szostak, Samaneh Abbasi-Sureshjani, Alvaro Gomariz

TL;DR

This work addresses biased evaluation in digital pathology foundation models that arises when downstream aggregation configurations are fixed. It introduces a comprehensive framework that evaluates seven feature extractors across three datasets under a broad grid of 162 aggregation configurations, totaling 3,402 experiments, to quantify sensitivity to the aggregation step. The key finding is that aggregation configuration substantially influences performance, and no universal config benefits all extractors; when accounting for this sensitivity, many feature extractors exhibit comparable performance, with DP-trained BYOL variants and cross-dataset generalization playing notable roles. The study argues for fairer, configuration-aware evaluation of downstream components to advance digital pathology foundation models and calls for extending this approach beyond classification tasks to better gauge real-world utility.

Abstract

Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods rely on a static downstream aggregation model setup, encompassing a fixed architecture and hyperparameters, a practice we identify as potentially biasing the results. Our study uncovers a sensitivity of feature extractor models towards aggregation model configurations, indicating that performance comparability can be skewed based on the chosen configurations. By accounting for this sensitivity, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the feature extractors' sensitivity to various aggregation model configurations, leading to a fairer and more accurate assessment of new foundation models in digital pathology.

The Importance of Downstream Networks in Digital Pathology Foundation Models

TL;DR

This work addresses biased evaluation in digital pathology foundation models that arises when downstream aggregation configurations are fixed. It introduces a comprehensive framework that evaluates seven feature extractors across three datasets under a broad grid of 162 aggregation configurations, totaling 3,402 experiments, to quantify sensitivity to the aggregation step. The key finding is that aggregation configuration substantially influences performance, and no universal config benefits all extractors; when accounting for this sensitivity, many feature extractors exhibit comparable performance, with DP-trained BYOL variants and cross-dataset generalization playing notable roles. The study argues for fairer, configuration-aware evaluation of downstream components to advance digital pathology foundation models and calls for extending this approach beyond classification tasks to better gauge real-world utility.

Abstract

Digital pathology has significantly advanced disease detection and pathologist efficiency through the analysis of gigapixel whole-slide images (WSI). In this process, WSIs are first divided into patches, for which a feature extractor model is applied to obtain feature vectors, which are subsequently processed by an aggregation model to predict the respective WSI label. With the rapid evolution of representation learning, numerous new feature extractor models, often termed foundational models, have emerged. Traditional evaluation methods rely on a static downstream aggregation model setup, encompassing a fixed architecture and hyperparameters, a practice we identify as potentially biasing the results. Our study uncovers a sensitivity of feature extractor models towards aggregation model configurations, indicating that performance comparability can be skewed based on the chosen configurations. By accounting for this sensitivity, we find that the performance of many current feature extractor models is notably similar. We support this insight by evaluating seven feature extractor models across three different datasets with 162 different aggregation model configurations. This comprehensive approach provides a more nuanced understanding of the feature extractors' sensitivity to various aggregation model configurations, leading to a fairer and more accurate assessment of new foundation models in digital pathology.
Paper Structure (10 sections, 5 figures, 3 tables)

This paper contains 10 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Typical frameworks for evaluation of feature extraction models use fixed configurations in the aggregation models, leading to substantially different results and hence limited informative value.
  • Figure 2: Illustration of the typical classification pipeline with MIL in digital pathology.
  • Figure 3: The heatmap shows the performance of every aggregation model configuration set for each feature extraction model. The red colored legend shows how the configurations are ordered on the heatmap.
  • Figure 4: Comparison of 7 feature extraction models across 162 different aggregation model configurations, which include 2 architectures with 81 parameters each.
  • Figure S1: The heatmap shows the performance of every aggregation model configuration for each feature extraction model according to the AP metric. The lighter the color, the higher the AP. The heatmaps are shown for all three datasets. The red-colored legend shows how the aggregation model configurations are ordered on the heatmap. For example, the lowest learning rate is used for the first third of the configurations for each feature aggregator, followed by the next higher learning rate.