An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma
Ramanathan Swaminathan
TL;DR
The paper addresses early glaucoma screening from fundus images by fusing local CNN features (EfficientNet-B0) with global ViT representations through a cross-attention mechanism. Using a combined Drishti and ACRIMA dataset, the authors demonstrate a 94.8% accuracy, with ablation studies confirming the superiority of cross-attention over simple concatenation or self-attention. Grad-CAM visualizations provide interpretable heatmaps centered on the optic disc and cup, supporting clinical trust and potential deployment in resource-limited settings. The work suggests practical applicability in cloud-enabled, semi-automated screening to aid ophthalmologists worldwide, especially in developing regions.
Abstract
This research work reveals the strengths of intertwining a deep custom convolutional neural network with a disruptive Vision Transformer, both fused together with a radical Cross-Attention module. Here, two high-yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized. The Cross-Attention mechanism facilitates the model in learning regions in the fundus that are clinically relevant through bidirectional feature exchange between CNN and ViT streams. Experiments clearly depict improved performance when compared to standalone baseline CNN and ViT models.
