Predictive uncertainty estimation in deep learning for lung carcinoma classification in digital pathology under real dataset shifts
Abdur R. Fayjie, Jutika Borah, Florencia Carbone, Jan Tack, Patrick Vandewalle
TL;DR
This work addresses the problem of unreliable predictions from deep learning models in digital pathology when faced with real-world distribution shifts. It conducts a large-scale benchmark comparing MC-dropout, deep ensembles, and few-shot learning (FSL) for lung carcinoma classification, using entropy as the uncertainty measure across in-domain, in-distribution shifts, and out-of-distribution data. The study finds that while in-domain performance is high, shifts—especially organ-origin and modality changes—degrade accuracy, yet ensembles and FSL generally provide better calibration and uncertainty estimates than MC-dropout or baseline. The results suggest that uncertainty-aware DL can support safer clinical decision-making in digital pathology, though practical deployment will require addressing computational costs and further calibration, especially for novel abnormal cases.
Abstract
Deep learning has shown tremendous progress in a wide range of digital pathology and medical image classification tasks. Its integration into safe clinical decision-making support requires robust and reliable models. However, real-world data comes with diversities that often lie outside the intended source distribution. Moreover, when test samples are dramatically different, clinical decision-making is greatly affected. Quantifying predictive uncertainty in models is crucial for well-calibrated predictions and determining when (or not) to trust a model. Unfortunately, many works have overlooked the importance of predictive uncertainty estimation. This paper evaluates whether predictive uncertainty estimation adds robustness to deep learning-based diagnostic decision-making systems. We investigate the effect of various carcinoma distribution shift scenarios on predictive performance and calibration. We first systematically investigate three popular methods for improving predictive uncertainty: Monte Carlo dropout, deep ensemble, and few-shot learning on lung adenocarcinoma classification as a primary disease in whole slide images. Secondly, we compare the effectiveness of the methods in terms of performance and calibration under clinically relevant distribution shifts such as in-distribution shifts comprising primary disease sub-types and other characterization analysis data; out-of-distribution shifts comprising well-differentiated cases, different organ origin, and imaging modality shifts. While studies on uncertainty estimation exist, to our best knowledge, no rigorous large-scale benchmark compares predictive uncertainty estimation including these dataset shifts for lung carcinoma classification.
