Current Pathology Foundation Models are unrobust to Medical Center Differences
Edwin D. de Jong, Eric Marcus, Jonas Teuwen
TL;DR
This paper tackles the problem of center-induced bias in pathology foundation models by introducing the Robustness Index, a neighborhood-based metric that quantifies whether embeddings organize more by biological signals or by medical-center signatures. Using 10 public pathology FMs and a patch-level TCGA-2k dataset, it shows pervasive center-driven structure in most models, with only Virchow2 surpassing a robustness index above 1, indicating biology-dominant organization. The study combines a simple k-NN evaluation, center-prediction controls, and t-SNE visualizations to demonstrate how center information actively undermines cancer-type predictions and generalization to unseen centers. The authors argue for explicit measurement and methodological mitigation to enable safe clinical deployment of pathology FMs, and provide the Robustness Index as a practical tool to guide robust model development and evaluation.
Abstract
Pathology Foundation Models (FMs) hold great promise for healthcare. Before they can be used in clinical practice, it is essential to ensure they are robust to variations between medical centers. We measure whether pathology FMs focus on biological features like tissue and cancer type, or on the well known confounding medical center signatures introduced by staining procedure and other differences. We introduce the Robustness Index. This novel robustness metric reflects to what degree biological features dominate confounding features. Ten current publicly available pathology FMs are evaluated. We find that all current pathology foundation models evaluated represent the medical center to a strong degree. Significant differences in the robustness index are observed. Only one model so far has a robustness index greater than one, meaning biological features dominate confounding features, but only slightly. A quantitative approach to measure the influence of medical center differences on FM-based prediction performance is described. We analyze the impact of unrobustness on classification performance of downstream models, and find that cancer-type classification errors are not random, but specifically attributable to same-center confounders: images of other classes from the same medical center. We visualize FM embedding spaces, and find these are more strongly organized by medical centers than by biological factors. As a consequence, the medical center of origin is predicted more accurately than the tissue source and cancer type. The robustness index introduced here is provided with the aim of advancing progress towards clinical adoption of robust and reliable pathology FMs.
