Table of Contents
Fetching ...

Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration

Taejin Jeong, Joohyeok Kim, Jaehoon Joo, Seong Jae Hwang

TL;DR

Glaucoma diagnosis suffers from diagnostic subjectivity and model overconfidence. The paper presents V-ViT, a Voting-based ViT that fuses binocular fundus images and patient metadata via cross-attention and multi-task learning, and employs a Monte Carlo dropout-based Voting System to calibrate predictions. A variance-matching regularization term aligns the model's uncertainty with inter-observer variance $\sigma^2_{vote}$, using averaged expert labels to smooth targets. On GreenCross and PAPILA datasets, V-ViT achieves state-of-the-art calibration and discrimination, with attention maps focusing on clinically meaningful regions, supporting safe and interpretable clinical deployment.

Abstract

Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma's systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient's binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at https://github.com/starforTJ/V-ViT.

Rethinking Glaucoma Calibration: Voting-Based Binocular and Metadata Integration

TL;DR

Glaucoma diagnosis suffers from diagnostic subjectivity and model overconfidence. The paper presents V-ViT, a Voting-based ViT that fuses binocular fundus images and patient metadata via cross-attention and multi-task learning, and employs a Monte Carlo dropout-based Voting System to calibrate predictions. A variance-matching regularization term aligns the model's uncertainty with inter-observer variance , using averaged expert labels to smooth targets. On GreenCross and PAPILA datasets, V-ViT achieves state-of-the-art calibration and discrimination, with attention maps focusing on clinically meaningful regions, supporting safe and interpretable clinical deployment.

Abstract

Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma's systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient's binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at https://github.com/starforTJ/V-ViT.

Paper Structure

This paper contains 13 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of V-ViT. Through a total of $L$ Cross-Attention blocks, V-ViT learns to incorporate the fellow image and metadata for glaucoma diagnosis.
  • Figure 2: (a) The calibration curves comparing the model proposed by Vijayan et al. with V-ViT. (b) The ROC curves comparing the model proposed by Vijayan et al. with V-ViT.
  • Figure 3: Attention map for the target image. The yellow boxes and green regions on the target image indicate the areas that should be focused on for glaucoma diagnosis.