Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration
Taejin Jeong, Joohyeok Kim, Jaehoon Joo, Seong Jae Hwang
TL;DR
Glaucoma diagnosis suffers from diagnostic subjectivity and model overconfidence. The paper presents V-ViT, a Voting-based ViT that fuses binocular fundus images and patient metadata via cross-attention and multi-task learning, and employs a Monte Carlo dropout-based Voting System to calibrate predictions. A variance-matching regularization term aligns the model's uncertainty with inter-observer variance $\sigma^2_{vote}$, using averaged expert labels to smooth targets. On GreenCross and PAPILA datasets, V-ViT achieves state-of-the-art calibration and discrimination, with attention maps focusing on clinically meaningful regions, supporting safe and interpretable clinical deployment.
Abstract
Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma's systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient's binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at https://github.com/starforTJ/V-ViT.
