Table of Contents
Fetching ...

Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts

Abdur R. Fayjie, Jutika Borah, Florencia Carbone, Jan Tack, Patrick Vandewalle

TL;DR

This work addresses the problem of unreliable predictions from deep learning models in digital pathology when faced with real-world distribution shifts. It conducts a large-scale benchmark comparing MC-dropout, deep ensembles, and few-shot learning (FSL) for lung carcinoma classification, using entropy as the uncertainty measure across in-domain, in-distribution shifts, and out-of-distribution data. The study finds that while in-domain performance is high, shifts—especially organ-origin and modality changes—degrade accuracy, yet ensembles and FSL generally provide better calibration and uncertainty estimates than MC-dropout or baseline. The results suggest that uncertainty-aware DL can support safer clinical decision-making in digital pathology, though practical deployment will require addressing computational costs and further calibration, especially for novel abnormal cases.

Abstract

Deep learning has shown tremendous progress in a wide range of digital pathology and medical image classification tasks. Its integration into safe clinical decision-making support requires robust and reliable models. However, real-world data comes with diversities that often lie outside the intended source distribution. Moreover, when test samples are dramatically different, clinical decision-making is greatly affected. Quantifying predictive uncertainty in models is crucial for well-calibrated predictions and determining when (or not) to trust a model. Unfortunately, many works have overlooked the importance of predictive uncertainty estimation. This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems. We investigate the effect of various carcinoma distribution shift scenarios on predictive performance and calibration. We first systematically investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images. Secondly, we compare the effectiveness of the methods in terms of performance and calibration under clinically relevant distribution shifts such as in-distribution shifts comprising primary disease sub-types and other characterization analysis data; out-of-distribution shifts comprising well-differentiated cases, different organ origin, and imaging modality shifts. While studies on uncertainty estimation exist, to our best knowledge, no rigorous large-scale benchmark compares predictive uncertainty estimation including these dataset shifts for lung carcinoma classification.

Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts

TL;DR

This work addresses the problem of unreliable predictions from deep learning models in digital pathology when faced with real-world distribution shifts. It conducts a large-scale benchmark comparing MC-dropout, deep ensembles, and few-shot learning (FSL) for lung carcinoma classification, using entropy as the uncertainty measure across in-domain, in-distribution shifts, and out-of-distribution data. The study finds that while in-domain performance is high, shifts—especially organ-origin and modality changes—degrade accuracy, yet ensembles and FSL generally provide better calibration and uncertainty estimates than MC-dropout or baseline. The results suggest that uncertainty-aware DL can support safer clinical decision-making in digital pathology, though practical deployment will require addressing computational costs and further calibration, especially for novel abnormal cases.

Abstract

Deep learning has shown tremendous progress in a wide range of digital pathology and medical image classification tasks. Its integration into safe clinical decision-making support requires robust and reliable models. However, real-world data comes with diversities that often lie outside the intended source distribution. Moreover, when test samples are dramatically different, clinical decision-making is greatly affected. Quantifying predictive uncertainty in models is crucial for well-calibrated predictions and determining when (or not) to trust a model. Unfortunately, many works have overlooked the importance of predictive uncertainty estimation. This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems. We investigate the effect of various carcinoma distribution shift scenarios on predictive performance and calibration. We first systematically investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images. Secondly, we compare the effectiveness of the methods in terms of performance and calibration under clinically relevant distribution shifts such as in-distribution shifts comprising primary disease sub-types and other characterization analysis data; out-of-distribution shifts comprising well-differentiated cases, different organ origin, and imaging modality shifts. While studies on uncertainty estimation exist, to our best knowledge, no rigorous large-scale benchmark compares predictive uncertainty estimation including these dataset shifts for lung carcinoma classification.
Paper Structure (22 sections, 4 equations, 2 figures, 5 tables)

This paper contains 22 sections, 4 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Example images from each dataset contributing to different data distribution. From the top left: WSI LC25000 (a) Lung Adenocarcinoma, (b) Normal, (c) SCC, (d) Normal SCC (e) Colon Adenocarcinoma (f) Colon Normal; WSI BMRIDS: (g) Acinar, (h) Lepidic, (i) Solid (j) micropapillary (k) papillary; CPTAC-LUAD: (l) LUAD-positive (m) LUAD-Negative; Pneumonia CXRs: (n) Pneumonia-positive (o) Normal
  • Figure 2: Experimental set-up features various distribution shifts from histopathology analysis to more specific characterization such as proteomic analysis for lung carcinoma and its sub-types classification. Training distributions, $D_{in, train}$ contains samples from LC25000 datasets. Internal test distribution, $D_{in, test}$, are taken from the same training distribution. Unseen test distribution comprises of in-distribution $D_{test, ext}$, and OOD shifts $D_{ood, test}$. The in-distribution shifts comprise two datasets with different geographical origins and characterization $D_{ext, prot}$, and class distribution $D_{ext, 5ad}$. OOD shifts consist of datasets with different carcinoma sub-type $D_{ood, scc}$ which are morphologically different, different organ origins $D_{ood, cad}$, and test imaging modality completely different from training sample modality $D_{ood, cxr}$.