Table of Contents
Fetching ...

Enhancing DR Classification with Swin Transformer and Shifted Window Attention

Meher Boulaabi, Takwa Ben Aïcha Gader, Afef Kacem Echi, Zied Bouraoui

TL;DR

Diabetic retinopathy classification is hindered by image quality, class imbalance, and subtle retinal features. The authors present a Swin Transformer–based framework with a preprocessing pipeline (cropping and CLAHE) to enhance feature extraction and generalization, leveraging shifted window attention for efficient high-resolution processing. On Aptos and IDRiD, the method achieves 89.65% and 97.40% accuracy across five DR grades, with preprocessing delivering notable performance gains. This approach offers a scalable, robust solution for automated retinal screening with potential clinical impact.

Abstract

Diabetic retinopathy (DR) is a leading cause of blindness worldwide, underscoring the importance of early detection for effective treatment. However, automated DR classification remains challenging due to variations in image quality, class imbalance, and pixel-level similarities that hinder model training. To address these issues, we propose a robust preprocessing pipeline incorporating image cropping, Contrast-Limited Adaptive Histogram Equalization (CLAHE), and targeted data augmentation to improve model generalization and resilience. Our approach leverages the Swin Transformer, which utilizes hierarchical token processing and shifted window attention to efficiently capture fine-grained features while maintaining linear computational complexity. We validate our method on the Aptos and IDRiD datasets for multi-class DR classification, achieving accuracy rates of 89.65% and 97.40%, respectively. These results demonstrate the effectiveness of our model, particularly in detecting early-stage DR, highlighting its potential for improving automated retinal screening in clinical settings.

Enhancing DR Classification with Swin Transformer and Shifted Window Attention

TL;DR

Diabetic retinopathy classification is hindered by image quality, class imbalance, and subtle retinal features. The authors present a Swin Transformer–based framework with a preprocessing pipeline (cropping and CLAHE) to enhance feature extraction and generalization, leveraging shifted window attention for efficient high-resolution processing. On Aptos and IDRiD, the method achieves 89.65% and 97.40% accuracy across five DR grades, with preprocessing delivering notable performance gains. This approach offers a scalable, robust solution for automated retinal screening with potential clinical impact.

Abstract

Diabetic retinopathy (DR) is a leading cause of blindness worldwide, underscoring the importance of early detection for effective treatment. However, automated DR classification remains challenging due to variations in image quality, class imbalance, and pixel-level similarities that hinder model training. To address these issues, we propose a robust preprocessing pipeline incorporating image cropping, Contrast-Limited Adaptive Histogram Equalization (CLAHE), and targeted data augmentation to improve model generalization and resilience. Our approach leverages the Swin Transformer, which utilizes hierarchical token processing and shifted window attention to efficiently capture fine-grained features while maintaining linear computational complexity. We validate our method on the Aptos and IDRiD datasets for multi-class DR classification, achieving accuracy rates of 89.65% and 97.40%, respectively. These results demonstrate the effectiveness of our model, particularly in detecting early-stage DR, highlighting its potential for improving automated retinal screening in clinical settings.

Paper Structure

This paper contains 11 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: DR classification pipeline. The input images undergo preprocessing with CLAHE and data augmentation to enhance robustness. Images are divided into patches using PatchExtract and projected into a high-dimensional space via PatchEmbedding. These embeddings pass through transformer blocks, where Module 1 applies standard window-based self-attention and Module 2 uses shifted window attention for better cross-window learning. Residual connections and MLP layers with dropout ensure stability and improved feature extraction. Finally, PatchMerging reduces spatial dimensions, followed by global average pooling and a dense layer with softmax classifier.