Table of Contents
Fetching ...

OCT-SelfNet: A Self-Supervised Framework with Multi-Modal Datasets for Generalized and Robust Retinal Disease Detection

Fatema-E Jannat, Sina Gholami, Minhaj Nur Alam, Hamed Tabkhi

TL;DR

OCT-SelfNet tackles limited generalization in retinal disease detection across multi-institution OCT data by employing a two-phase framework: self-supervised pretraining with masked autoencoders on SwinV2 backbones using unlabeled data, followed by supervised fine-tuning for AMD/Normal classification. The approach compares three transformer-based MAEs (ViT, Swin, SwinV2) and demonstrates that SwinV2-based MAE yields the best representations, enabling effective transfer to downstream tasks and cross-dataset evaluations. Across DS1–DS3, OCT-SelfNet-SwinV2 consistently outperforms a ResNet-50 baseline in AUC-ROC and AUC-PR, including under reduced data scenarios and unseen datasets, highlighting strong generalization and robustness. The study demonstrates practical implications for clinical deployment by reducing labeling requirements and improving reliability in diverse real-world OCT data settings.

Abstract

Despite the revolutionary impact of AI and the development of locally trained algorithms, achieving widespread generalized learning from multi-modal data in medical AI remains a significant challenge. This gap hinders the practical deployment of scalable medical AI solutions. Addressing this challenge, our research contributes a self-supervised robust machine learning framework, OCT-SelfNet, for detecting eye diseases using optical coherence tomography (OCT) images. In this work, various data sets from various institutions are combined enabling a more comprehensive range of representation. Our method addresses the issue using a two-phase training approach that combines self-supervised pretraining and supervised fine-tuning with a mask autoencoder based on the SwinV2 backbone by providing a solution for real-world clinical deployment. Extensive experiments on three datasets with different encoder backbones, low data settings, unseen data settings, and the effect of augmentation show that our method outperforms the baseline model, Resnet-50 by consistently attaining AUC-ROC performance surpassing 77% across all tests, whereas the baseline model exceeds 54%. Moreover, in terms of the AUC-PR metric, our proposed method exceeded 42%, showcasing a substantial increase of at least 10% in performance compared to the baseline, which exceeded only 33%. This contributes to our understanding of our approach's potential and emphasizes its usefulness in clinical settings.

OCT-SelfNet: A Self-Supervised Framework with Multi-Modal Datasets for Generalized and Robust Retinal Disease Detection

TL;DR

OCT-SelfNet tackles limited generalization in retinal disease detection across multi-institution OCT data by employing a two-phase framework: self-supervised pretraining with masked autoencoders on SwinV2 backbones using unlabeled data, followed by supervised fine-tuning for AMD/Normal classification. The approach compares three transformer-based MAEs (ViT, Swin, SwinV2) and demonstrates that SwinV2-based MAE yields the best representations, enabling effective transfer to downstream tasks and cross-dataset evaluations. Across DS1–DS3, OCT-SelfNet-SwinV2 consistently outperforms a ResNet-50 baseline in AUC-ROC and AUC-PR, including under reduced data scenarios and unseen datasets, highlighting strong generalization and robustness. The study demonstrates practical implications for clinical deployment by reducing labeling requirements and improving reliability in diverse real-world OCT data settings.

Abstract

Despite the revolutionary impact of AI and the development of locally trained algorithms, achieving widespread generalized learning from multi-modal data in medical AI remains a significant challenge. This gap hinders the practical deployment of scalable medical AI solutions. Addressing this challenge, our research contributes a self-supervised robust machine learning framework, OCT-SelfNet, for detecting eye diseases using optical coherence tomography (OCT) images. In this work, various data sets from various institutions are combined enabling a more comprehensive range of representation. Our method addresses the issue using a two-phase training approach that combines self-supervised pretraining and supervised fine-tuning with a mask autoencoder based on the SwinV2 backbone by providing a solution for real-world clinical deployment. Extensive experiments on three datasets with different encoder backbones, low data settings, unseen data settings, and the effect of augmentation show that our method outperforms the baseline model, Resnet-50 by consistently attaining AUC-ROC performance surpassing 77% across all tests, whereas the baseline model exceeds 54%. Moreover, in terms of the AUC-PR metric, our proposed method exceeded 42%, showcasing a substantial increase of at least 10% in performance compared to the baseline, which exceeded only 33%. This contributes to our understanding of our approach's potential and emphasizes its usefulness in clinical settings.
Paper Structure (30 sections, 2 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 30 sections, 2 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the framework: In the initial pre-training phase (upper section), the framework utilizes masked image autoencoder as a self-supervised task to learn representations from unlabeled images. In this process, a random subset of image patches is masked and fed into the auto-encoder. Then in the subsequent fine-tuning stage (lower section), the pre-trained encoder from the first phase is employed along with a linear classifier for the classification task. The learned weights from the pre-training phase are transferred to the fine-tuning phase.
  • Figure 2: Illustration of Normal and AMD sample OCTs from three datasets (DS1, DS2, and DS3) along with bar chart and donut chart depicting their distribution.
  • Figure 3: Evaluation of AUC-ROC and AUC-PR for Test Set-1, Test Set-2, and Test Set-3 after fine-tuning on Dataset-1 with OCT-SelfNet-Swinv2 and assessing performance on other test sets.
  • Figure 4: Qualitative visualizations of the performance of SwinV2-Based Self-Supervised MAE on Three Datasets. From left to right: Input image with randomly masked regions, reconstructed images with predicted patches, and ground truth image.
  • Figure 5: Evaluation of AUC-ROC and AUC-PR for Test Set-1, Test Set-2, and Test Set-3 after fine-tuning on Dataset-2 with OCT-SelfNet-SwinV2 and assessing performance on other test sets.
  • ...and 2 more figures