Table of Contents
Fetching ...

Beyond accuracy: quantifying the reliability of Multiple Instance Learning for Whole Slide Image classification

Hassan Keshvarikhojasteh, Marc Aubreville, Christof A. Bertram, Josien P. W. Pluim, Mitko Veta

TL;DR

This paper introduces three quantitative metrics for reliability assessment and applies them to several widely used MIL architectures across three region-wise annotated pathology datasets, indicating that the mean pooling instance (MEAN-POOL-INS) model demonstrates superior reliability compared to other networks.

Abstract

Machine learning models have become integral to many fields, but their reliability, defined as producing dependable, trustworthy, and domain-consistent predictions, remains a critical concern. Multiple Instance Learning (MIL) models designed for Whole Slide Image (WSI) classification in computational pathology are rarely evaluated in terms of reliability, leaving a key gap in understanding their suitability for high-stakes applications like clinical decision-making. In this paper, we address this gap by introducing three quantitative metrics for reliability assessment and applying them to several widely used MIL architectures across three region-wise annotated pathology datasets. Our findings indicate that the mean pooling instance (MEAN-POOL-INS)model demonstrates superior reliability compared to other networks, despite its simple architectural design and computational efficiency. These findings underscore the need of reliability evaluation alongside predictive performance in MIL models and establish MEAN-POOL-INS as a strong, trustworthy baseline for future research.

Beyond accuracy: quantifying the reliability of Multiple Instance Learning for Whole Slide Image classification

TL;DR

This paper introduces three quantitative metrics for reliability assessment and applies them to several widely used MIL architectures across three region-wise annotated pathology datasets, indicating that the mean pooling instance (MEAN-POOL-INS) model demonstrates superior reliability compared to other networks.

Abstract

Machine learning models have become integral to many fields, but their reliability, defined as producing dependable, trustworthy, and domain-consistent predictions, remains a critical concern. Multiple Instance Learning (MIL) models designed for Whole Slide Image (WSI) classification in computational pathology are rarely evaluated in terms of reliability, leaving a key gap in understanding their suitability for high-stakes applications like clinical decision-making. In this paper, we address this gap by introducing three quantitative metrics for reliability assessment and applying them to several widely used MIL architectures across three region-wise annotated pathology datasets. Our findings indicate that the mean pooling instance (MEAN-POOL-INS)model demonstrates superior reliability compared to other networks, despite its simple architectural design and computational efficiency. These findings underscore the need of reliability evaluation alongside predictive performance in MIL models and establish MEAN-POOL-INS as a strong, trustworthy baseline for future research.
Paper Structure (18 sections, 10 figures, 9 tables)

This paper contains 18 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The overall framework for evaluating the reliability of MIL models follows a three-step process. First, a MIL model is trained on a weakly-supervised task for predicting slide-level labels. Next, the trained model is applied to predict scores for individual image patches. Finally, the reliability value is computed based on the predicted patch scores and their corresponding annotations, where tumor patches are highlighted in green and normal patches in orange in the annotation visualization.
  • Figure 2: (I) The test-30 slide with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MAX-POOL, showing predicted patch scores distribution from low (blue) to high (red). The annotation and heatmap are spatially aligned for comparison.
  • Figure 3: (I) The test-40 slide with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MEAN-POOL-INS, showing predicted patch scores distribution from low (blue) to high (red). The annotation and heatmap are spatially aligned for comparison.
  • Figure 4: (I) A slide from CATCH with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by MAX-POOL, showing predicted patch scores distribution from low (blue) to high (red). The annotation and heatmap are spatially aligned for comparison.
  • Figure 5: (I) A slide from CATCH with ground truth annotations (green) overlaid on the tissue section. (II) Corresponding heatmap generated by ACMIL/4, showing predicted patch scores distribution from low (blue) to high (red). The annotation and heatmap are spatially aligned for comparison.
  • ...and 5 more figures