Table of Contents
Fetching ...

Benchmarking Domain Generalization Algorithms in Computational Pathology

Neda Zamanitajeddin, Mostafa Jahanifar, Kesi Xu, Fouzia Siraj, Nasir Rajpoot

TL;DR

This study tackles domain shift in computational pathology by systematically benchmarking 30 domain generalization (DG) algorithms on three tasks (CAMELYON17, MIDOG22, HISTOPANTUM) using a unified HistoDomainBed platform and 7,560 cross-validation runs. It finds that self-supervised learning and stain augmentation consistently deliver strong generalization, while simple ERM baselines remain competitive with proper design; stain normalization also performs well in several settings. The work introduces HISTOPANTUM, a pan-cancer tumor-detection dataset, and provides practical DG guidelines tailored to CPath, emphasizing pretrained fine-tuning and modality-specific augmentations. Together, these contributions offer a scalable, reproducible benchmark and actionable insights to improve robust performance of DL models under domain shifts in histopathology. The findings have direct practical impact for deploying DG methods in clinical-pathology pipelines and guiding future foundation-model integration in CPath.

Abstract

Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.

Benchmarking Domain Generalization Algorithms in Computational Pathology

TL;DR

This study tackles domain shift in computational pathology by systematically benchmarking 30 domain generalization (DG) algorithms on three tasks (CAMELYON17, MIDOG22, HISTOPANTUM) using a unified HistoDomainBed platform and 7,560 cross-validation runs. It finds that self-supervised learning and stain augmentation consistently deliver strong generalization, while simple ERM baselines remain competitive with proper design; stain normalization also performs well in several settings. The work introduces HISTOPANTUM, a pan-cancer tumor-detection dataset, and provides practical DG guidelines tailored to CPath, emphasizing pretrained fine-tuning and modality-specific augmentations. Together, these contributions offer a scalable, reproducible benchmark and actionable insights to improve robust performance of DL models under domain shifts in histopathology. The findings have direct practical impact for deploying DG methods in clinical-pathology pipelines and guiding future foundation-model integration in CPath.

Abstract

Deep learning models have shown immense promise in computational pathology (CPath) tasks, but their performance often suffers when applied to unseen data due to domain shifts. Addressing this requires domain generalization (DG) algorithms. However, a systematic evaluation of DG algorithms in the CPath context is lacking. This study aims to benchmark the effectiveness of 30 DG algorithms on 3 CPath tasks of varying difficulty through 7,560 cross-validation runs. We evaluate these algorithms using a unified and robust platform, incorporating modality-specific techniques and recent advances like pretrained foundation models. Our extensive cross-validation experiments provide insights into the relative performance of various DG strategies. We observe that self-supervised learning and stain augmentation consistently outperform other methods, highlighting the potential of pretrained models and data augmentation. Furthermore, we introduce a new pan-cancer tumor detection dataset (HISTOPANTUM) as a benchmark for future research. This study offers valuable guidance to researchers in selecting appropriate DG approaches for CPath tasks.
Paper Structure (25 sections, 5 figures, 2 tables)

This paper contains 25 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Domain shift in computational pathology can cause degradation in performance when testing on an unseen dataset (A). Different types of DS are illustrated in (B) with shapes as classes, colors as features, and each circle as a domain. In (B), covariate shift is presented by changing the color of objects in two different domains, prior shift happens when the distribution of classes differs between the two domains, and posterior shift is shown when the same objects are labeled differently by the observers (highlighted shapes), and in class-conditional shift, the color of only one class is changing between domains. Leveraging three different tasks (C) in this work, we benchmark the performance of 30 domain generalization algorithms (D) in a series of robust cross-validation experiments (E).
  • Figure 2: Tasks and datasets used in the benchmarking process: (A) Breast cancer metastasis detection leveraging Camleyon17 dataset 75_bandi2019cpath, (B) Mitosis detection in MIDOG22 dataset aubreville2024domain, and (C) tumor detection in our proposed HISTOPANTUM dataset. For every dataset, an example from each domain and class is provided. All the tasks are designed as a binary classification task, where the name and population of positive and negative classes are shown in red and blue color bars, respectively. The hatched region in each bar represents the fraction of samples used to generate small datasets (see \ref{['sec:sub-dataset']}).
  • Figure 3: Benchmarked algorithms categorized into different domain generalization methodologies, as introduced in jahanifar2023domain
  • Figure 4: Benchmarking results for different algorithms, (A) Accuracy and (B) F1 Score. Each domain is presented by a unique color and the average performance of all algorithms over each domain is presented with a horizontal dashed line with the same color.
  • Figure 5: Comparative analysis of algorithm performance in small vs. full dataset regimes, showcasing Accuracy (A) and F1 score (B) metrics. The plots illustrate how each algorithm's effectiveness varies with dataset size. Algorithms close to the bottom-left corner are desirable.