Table of Contents
Fetching ...

An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

Ramanathan Swaminathan

TL;DR

The paper addresses early glaucoma screening from fundus images by fusing local CNN features (EfficientNet-B0) with global ViT representations through a cross-attention mechanism. Using a combined Drishti and ACRIMA dataset, the authors demonstrate a 94.8% accuracy, with ablation studies confirming the superiority of cross-attention over simple concatenation or self-attention. Grad-CAM visualizations provide interpretable heatmaps centered on the optic disc and cup, supporting clinical trust and potential deployment in resource-limited settings. The work suggests practical applicability in cloud-enabled, semi-automated screening to aid ophthalmologists worldwide, especially in developing regions.

Abstract

This research work reveals the strengths of intertwining a deep custom convolutional neural network with a disruptive Vision Transformer, both fused together with a radical Cross-Attention module. Here, two high-yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized. The Cross-Attention mechanism facilitates the model in learning regions in the fundus that are clinically relevant through bidirectional feature exchange between CNN and ViT streams. Experiments clearly depict improved performance when compared to standalone baseline CNN and ViT models.

An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

TL;DR

The paper addresses early glaucoma screening from fundus images by fusing local CNN features (EfficientNet-B0) with global ViT representations through a cross-attention mechanism. Using a combined Drishti and ACRIMA dataset, the authors demonstrate a 94.8% accuracy, with ablation studies confirming the superiority of cross-attention over simple concatenation or self-attention. Grad-CAM visualizations provide interpretable heatmaps centered on the optic disc and cup, supporting clinical trust and potential deployment in resource-limited settings. The work suggests practical applicability in cloud-enabled, semi-automated screening to aid ophthalmologists worldwide, especially in developing regions.

Abstract

This research work reveals the strengths of intertwining a deep custom convolutional neural network with a disruptive Vision Transformer, both fused together with a radical Cross-Attention module. Here, two high-yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized. The Cross-Attention mechanism facilitates the model in learning regions in the fundus that are clinically relevant through bidirectional feature exchange between CNN and ViT streams. Experiments clearly depict improved performance when compared to standalone baseline CNN and ViT models.

Paper Structure

This paper contains 12 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: [16] The various layers of the EfficientNet-B0 model
  • Figure 2: [17] Comprehensive architecture of Vision Transformers is clearly depicted.
  • Figure 3: The accuracy metrics over the range of 0 to 50 epochs over training and testing datasets.
  • Figure 4: The loss values over the range of 0 to 50 epochs over training and testing datasets.
  • Figure 5: This image evidently highlights the optic disc and cup region, showing significant anomalies.
  • ...and 3 more figures