Table of Contents
Fetching ...

A protocol for evaluating robustness to H&E staining variation in computational pathology models

Lydia A. Schönpflug, Nikki van den Berg, Sonali Andani, Nanda Horeweg, Jurriaan Barkey Wolf, Tjalling Bosse, Viktor H. Koelzer, Maxime W. Lafarge

Abstract

Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (H&E) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to H&E staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high H&E intensity, low/high H&E color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ($Δ$ = 0.142). Robustness ranged from 0.007-0.079 ($Δ$ = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across H&E staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at https://github.com/CTPLab/staining-robustness-evaluation .

A protocol for evaluating robustness to H&E staining variation in computational pathology models

Abstract

Sensitivity to staining variation remains a major barrier to deploying computational pathology (CPath) models as hematoxylin and eosin (H&E) staining varies across laboratories, requiring systematic assessment of how this variability affects model prediction. In this work, we developed a three-step protocol for evaluating robustness to H&E staining variation in CPath models. Step 1: Select reference staining conditions, Step 2: Characterize test set staining properties, Step 3: Apply CPath model(s) under simulated reference staining conditions. Here, we first created a new reference staining library based on the PLISM dataset. As an exemplary use case, we applied the protocol to assess the robustness properties of 306 microsatellite instability (MSI) classification models on the unseen SurGen colorectal cancer dataset (n=738), including 300 attention-based multiple instance learning models trained on the TCGA-COAD/READ datasets across three feature extractors (UNI2-h, H-Optimus-1, Virchow2), alongside six public MSI classification models. Classification performance was measured as AUC, and robustness as the min-max AUC range across four simulated staining conditions (low/high H&E intensity, low/high H&E color similarity). Across models and staining conditions, classification performance ranged from AUC 0.769-0.911 ( = 0.142). Robustness ranged from 0.007-0.079 ( = 0.072), and showed a weak inverse correlation with classification performance (Pearson r=-0.22, 95% CI [-0.34, -0.11]). Thus, we show that the proposed evaluation protocol enables robustness-informed CPath model selection and provides insight into performance shifts across H&E staining conditions, supporting the identification of operational ranges for reliable model deployment. Code is available at https://github.com/CTPLab/staining-robustness-evaluation .
Paper Structure (23 sections, 2 equations, 4 figures, 1 table)

This paper contains 23 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A protocol for evaluating robustness to staining variation of CPath models. a) Select reference staining conditions based on the reference libray created from the PLISM dataset. b) Characterize SurGen test set staining properties. c) State-of-the art MSI classification models, with n=300 models reflecting plausible state-of-the-art models which were trained on TCGA COADREAD using one of three foundation models as feature extractors (Uni2-h, HOptimus1, Virchow2) and ABMIL for aggregation. We also considered n=6 publicly available models. d) Infer MSI classification models under four simulated reference staining conditions. e) Evaluation of results by measuring model performance and robustness to staining variation.
  • Figure 2: Staining characteristics of PLISM staining condition-device combinations. a) Intensity of Hematoxylin and Eosin, b) Angle between H&E stain vector in OD space, c) Distribution of H&E hues, measured as hue h° in CIELab space; left violin: Hematoxylin, right violin: Eosin. Marker colors correspond to RGB stain colors. The selected reference conditions (low and high intensity; low and high H&E color similarity) are circled in red and green respectively and highlighted with a black frame. For staining condition and device abbreviations please refer to Appendix Table A.1 and A.2.
  • Figure 3: Staining characteristics of SurGen WSIs. a) Intensity of Hematoxylin and Eosin, b) Angle between H&E stain vector in OD space, c) Distribution of H&E hues, measured as hue h° in CIELab space; left violin: Hematoxylin, right violin: Eosin. Marker colors correspond to RGB stain colors; low and high color similarity PLISM references are circled in green and red respectively.
  • Figure 4: Performance-robustness relationship across MSI classification models: a) Performance (AUC of reference condition) versus robustness (min-max AUC across all stain conditions) for all evaluated models (n=306). We report Pearson correlation between performance and robustness with 95% CIs. b) Top models (Performance > 0.90, Robustness <0.03; n=10). Dots indicate bootstrapped means (n=1000 iterations); ellipses represent 95% CIs for both performance and robustness.