Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging
Hua Xu, Julián D. Arias-Londoño, Juan I. Godino-Llorente
TL;DR
This work addresses the critical need for calibrated uncertainty in medical-imaging AI by introducing a generalizable Bayesian framework that enforces alignment between predictive accuracy and uncertainty. Central to the approach are the Confidence-Uncertainty Boundary Curve (CUBC), the CUB-Loss regularizer for training-time calibration, and the Dual Temperature Scaling (DTS) for post-hoc refinement, both built on a variational Bayesian neural network foundation. The method is validated across three tasks—pneumonia screening from chest X-rays, diabetic retinopathy detection, and skin lesion identification—demonstrating improved calibration (AvU) and robust uncertainty behavior under data scarcity and class imbalance, as well as effective near-OOD detection. The results indicate that the framework enhances clinical trust by producing low uncertainty for correct predictions and high uncertainty for incorrect ones, while preserving or improving accuracy, thereby supporting safer deployment of AI-assisted decision support in healthcare.
Abstract
In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.
