Selective Diabetic Retinopathy Screening with Accuracy-Weighted Deep Ensembles and Entropy-Guided Abstention
Jophy Lin
TL;DR
This work addresses the need for reliable, scalable diabetic retinopathy screening by developing an uncertainty-aware diagnostic framework. It introduces an accuracy-weighted ensemble of seven CNN architectures, fused through accuracy-based voting, and couples it with an entropy-based abstention mechanism to quantify and act on predictive uncertainty. On EyePACS, the unfiltered ensemble achieves $0.9370$ accuracy and $0.9376$ F1, while applying a probability-weighted entropy threshold (e.g., $H=0.38$) yields $0.9944$ accuracy and $0.9932$ F1 at the cost of discarding about $69.2\%$ of samples, illustrating tunable accuracy-coverage trade-offs. The approach enhances reliability and interpretability for clinical deployment, offering a generalizable paradigm for uncertainty-aware AI in high-stakes medical imaging and potential extension to multi-class DR grading and other diseases.
Abstract
Diabetic retinopathy (DR), a microvascular complication of diabetes and a leading cause of preventable blindness, is projected to affect more than 130 million individuals worldwide by 2030. Early identification is essential to reduce irreversible vision loss, yet current diagnostic workflows rely on methods such as fundus photography and expert review, which remain costly and resource-intensive. This, combined with DR's asymptomatic nature, results in its underdiagnosis rate of approximately 25 percent. Although convolutional neural networks (CNNs) have demonstrated strong performance in medical imaging tasks, limited interpretability and the absence of uncertainty quantification restrict clinical reliability. Therefore, in this study, a deep ensemble learning framework integrated with uncertainty estimation is introduced to improve robustness, transparency, and scalability in DR detection. The ensemble incorporates seven CNN architectures-ResNet-50, DenseNet-121, MobileNetV3 (Small and Large), and EfficientNet (B0, B2, B3)- whose outputs are fused through an accuracy-weighted majority voting strategy. A probability-weighted entropy metric quantifies prediction uncertainty, enabling low-confidence samples to be excluded or flagged for additional review. Training and validation on 35,000 EyePACS retinal fundus images produced an unfiltered accuracy of 93.70 percent (F1 = 0.9376). Uncertainty-filtering later was conducted to remove unconfident samples, resulting in maximum-accuracy of 99.44 percent (F1 = 0.9932). The framework shows that uncertainty-aware, accuracy-weighted ensembling improves reliability without hindering performance. With confidence-calibrated outputs and a tunable accuracy-coverage trade-off, it offers a generalizable paradigm for deploying trustworthy AI diagnostics in high-risk care.
