Table of Contents
Fetching ...

Detecting Domain Shift in Multiple Instance Learning for Digital Pathology Using Fréchet Domain Distance

Milda Pocevičiūtė, Gabriel Eilertsen, Stina Garvin, Claes Lundström

TL;DR

The paper addresses domain shift in MIL applied to digital pathology and introduces Fréchet Domain Distance (FDD) as an unsupervised metric to quantify shifts between datasets. Using an attention-based MIL model (CLAM) and various MIL features, the authors define FDD_K to compare aggregated feature statistics across datasets, demonstrating that FDD_64 with positive evidence best tracks performance degradation (mean MCC change) across clinically realistic shifts between Camelyon and BRLN data. The results show FDD_K outperforms uncertainty-based and representation-based baselines, suggesting it as a practical QA tool for deploying MIL systems at new sites without requiring additional pathologist annotations. This work provides a path toward safer, annotation-free site validation for MIL in digital pathology.

Abstract

Multiple-instance learning (MIL) is an attractive approach for digital pathology applications as it reduces the costs related to data collection and labelling. However, it is not clear how sensitive MIL is to clinically realistic domain shifts, i.e., differences in data distribution that could negatively affect performance, and if already existing metrics for detecting domain shifts work well with these algorithms. We trained an attention-based MIL algorithm to classify whether a whole-slide image of a lymph node contains breast tumour metastases. The algorithm was evaluated on data from a hospital in a different country and various subsets of this data that correspond to different levels of domain shift. Our contributions include showing that MIL for digital pathology is affected by clinically realistic differences in data, evaluating which features from a MIL model are most suitable for detecting changes in performance, and proposing an unsupervised metric named Fréchet Domain Distance (FDD) for quantification of domain shifts. Shift measure performance was evaluated through the mean Pearson correlation to change in classification performance, where FDD achieved 0.70 on 10-fold cross-validation models. The baselines included Deep ensemble, Difference of Confidence, and Representation shift which resulted in 0.45, -0.29, and 0.56 mean Pearson correlation, respectively. FDD could be a valuable tool for care providers and vendors who need to verify if a MIL system is likely to perform reliably when implemented at a new site, without requiring any additional annotations from pathologists.

Detecting Domain Shift in Multiple Instance Learning for Digital Pathology Using Fréchet Domain Distance

TL;DR

The paper addresses domain shift in MIL applied to digital pathology and introduces Fréchet Domain Distance (FDD) as an unsupervised metric to quantify shifts between datasets. Using an attention-based MIL model (CLAM) and various MIL features, the authors define FDD_K to compare aggregated feature statistics across datasets, demonstrating that FDD_64 with positive evidence best tracks performance degradation (mean MCC change) across clinically realistic shifts between Camelyon and BRLN data. The results show FDD_K outperforms uncertainty-based and representation-based baselines, suggesting it as a practical QA tool for deploying MIL systems at new sites without requiring additional pathologist annotations. This work provides a path toward safer, annotation-free site validation for MIL in digital pathology.

Abstract

Multiple-instance learning (MIL) is an attractive approach for digital pathology applications as it reduces the costs related to data collection and labelling. However, it is not clear how sensitive MIL is to clinically realistic domain shifts, i.e., differences in data distribution that could negatively affect performance, and if already existing metrics for detecting domain shifts work well with these algorithms. We trained an attention-based MIL algorithm to classify whether a whole-slide image of a lymph node contains breast tumour metastases. The algorithm was evaluated on data from a hospital in a different country and various subsets of this data that correspond to different levels of domain shift. Our contributions include showing that MIL for digital pathology is affected by clinically realistic differences in data, evaluating which features from a MIL model are most suitable for detecting changes in performance, and proposing an unsupervised metric named Fréchet Domain Distance (FDD) for quantification of domain shifts. Shift measure performance was evaluated through the mean Pearson correlation to change in classification performance, where FDD achieved 0.70 on 10-fold cross-validation models. The baselines included Deep ensemble, Difference of Confidence, and Representation shift which resulted in 0.45, -0.29, and 0.56 mean Pearson correlation, respectively. FDD could be a valuable tool for care providers and vendors who need to verify if a MIL system is likely to perform reliably when implemented at a new site, without requiring any additional annotations from pathologists.
Paper Structure (11 sections, 2 equations, 4 figures, 3 tables)

This paper contains 11 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Classification performance reported in mean (standard deviation) of MCC and ROC-AUC metrics, computed over the 10-fold CV models. A threshold for MCC is determined on validation data.
  • Figure 2: Box plots of Pearson correlations achieved by Fréchet distance and Representation shift metric using attention-based features, i.e., positive, negative, combined evidence, and randomly selected features. Varying number of extracted patch representations $K$ is considered: from 1 to 128. The reported results are over the 10 cross validation models.
  • Figure 3: Visualisation of an attention MIL framework for digital pathology. The features that are used in our experiments are marked in red. FC stands for Fully Connected.
  • Figure 4: Concatenated attention-based features: Pearson correlations achieved by Fréchet distance and Representation shift metric using positive, negative, combined evidence, and randomly selected features. Varying number of extracted patch representations $K$ is considered: from 1 to 128. The reported results are over the 10 CV models.