Table of Contents
Fetching ...

Do Histopathological Foundation Models Eliminate Batch Effects? A Comparative Study

Jonah Kömen, Hannah Marienwald, Jonas Dippel, Julius Hense

TL;DR

It is empirically show that the feature embeddings of the foundation models still contain distinct hospital signatures that can lead to biased predictions and misclassifications, paving the way for more robust pretraining strategies and downstream predictors.

Abstract

Deep learning has led to remarkable advancements in computational histopathology, e.g., in diagnostics, biomarker prediction, and outcome prognosis. Yet, the lack of annotated data and the impact of batch effects, e.g., systematic technical data differences across hospitals, hamper model robustness and generalization. Recent histopathological foundation models -- pretrained on millions to billions of images -- have been reported to improve generalization performances on various downstream tasks. However, it has not been systematically assessed whether they fully eliminate batch effects. In this study, we empirically show that the feature embeddings of the foundation models still contain distinct hospital signatures that can lead to biased predictions and misclassifications. We further find that the signatures are not removed by stain normalization methods, dominate distances in feature space, and are evident across various principal components. Our work provides a novel perspective on the evaluation of medical foundation models, paving the way for more robust pretraining strategies and downstream predictors.

Do Histopathological Foundation Models Eliminate Batch Effects? A Comparative Study

TL;DR

It is empirically show that the feature embeddings of the foundation models still contain distinct hospital signatures that can lead to biased predictions and misclassifications, paving the way for more robust pretraining strategies and downstream predictors.

Abstract

Deep learning has led to remarkable advancements in computational histopathology, e.g., in diagnostics, biomarker prediction, and outcome prognosis. Yet, the lack of annotated data and the impact of batch effects, e.g., systematic technical data differences across hospitals, hamper model robustness and generalization. Recent histopathological foundation models -- pretrained on millions to billions of images -- have been reported to improve generalization performances on various downstream tasks. However, it has not been systematically assessed whether they fully eliminate batch effects. In this study, we empirically show that the feature embeddings of the foundation models still contain distinct hospital signatures that can lead to biased predictions and misclassifications. We further find that the signatures are not removed by stain normalization methods, dominate distances in feature space, and are evident across various principal components. Our work provides a novel perspective on the evaluation of medical foundation models, paving the way for more robust pretraining strategies and downstream predictors.

Paper Structure

This paper contains 17 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Accuracy scores for predicting the TSS of a patch from the feature embedding of different histopathological foundation models. We considered features of raw and stain-normalized patches reinhardmacenko from two datasets: TCGA-LUSC-5 (top) with 5 and CAMELYON16 (bottom) with 2 tissue source sites. The most transparent bars represent the nearest centroid classifier, the medium transparent k-nearest neighbors, and the non-transparent bars linear probing models. Higher accuracy indicates stronger site signatures. The experimental details are described in Section \ref{['sec:source-site-prediction']}.
  • Figure 2: Arbitrarily chosen patches from the TCGA-LUSC-5 dataset (left) and the CAMELYON16 dataset (right). Differences in staining are apparent, e.g., see TSS 66, TSS 0, or TSS 1.
  • Figure 3: Cancer classification accuracy in % using LP on CAMELYON16 for different training data compositions ($x/y$). The test data composition remains 0/1 across splits. We assessed the performances for features of unnormalized, Reinhard, and Macenko normalized patches, and report mean and standard deviation over 5 repetitions.
  • Figure 4: Ordered Euclidean distances between the feature of a randomly drawn cancerous reference patch and the features of ss (bottom line), ossh (middle line), and osoh (top line) on CAMELYON16. Because the distances lay all on the same line (see all), small offsets were added for clearer visualization such that ss, ossh, and osoh become distinguishable.
  • Figure 5: Accuracy on site prediction task as described in Section \ref{['sec:source-site-prediction']} on TCGA-LUSC-5 using KNN based on features projected onto the first $\ell$ PCs (dots) and on non-reduced features (line).
  • ...and 2 more figures