Table of Contents
Fetching ...

ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

Mohammadreza Saraei, Igor Kozak, Eung-Joo Lee

TL;DR

The paper tackles OCT disease classification under limited labeled data and privacy constraints by introducing ViT-2SPN, a three-stage framework that combines supervised pretraining, dual-stream self-supervised pretraining on unlabeled OCTMNIST, and supervised fine-tuning with a ViT-base backbone and momentum encoder. The dual-stream SSL learns robust OCT representations via online/target encoders and a negative cosine similarity objective, followed by stratified 10-fold cross-validated fine-tuning on a small labeled subset. Empirically, ViT-2SPN achieves a mean AUC of 0.93 and an accuracy of 0.77, outperforming established SSP baselines (BYOL, MoCo, SimCLR, SwAV, SimSiam) across multiple metrics, demonstrating strong data efficiency and generalization. The work highlights practical impact for privacy-preserving ophthalmic diagnostics while acknowledging computational costs and proposing future directions like knowledge distillation and cross-modality extension to broaden clinical utility.

Abstract

Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods.

ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification

TL;DR

The paper tackles OCT disease classification under limited labeled data and privacy constraints by introducing ViT-2SPN, a three-stage framework that combines supervised pretraining, dual-stream self-supervised pretraining on unlabeled OCTMNIST, and supervised fine-tuning with a ViT-base backbone and momentum encoder. The dual-stream SSL learns robust OCT representations via online/target encoders and a negative cosine similarity objective, followed by stratified 10-fold cross-validated fine-tuning on a small labeled subset. Empirically, ViT-2SPN achieves a mean AUC of 0.93 and an accuracy of 0.77, outperforming established SSP baselines (BYOL, MoCo, SimCLR, SwAV, SimSiam) across multiple metrics, demonstrating strong data efficiency and generalization. The work highlights practical impact for privacy-preserving ophthalmic diagnostics while acknowledging computational costs and proposing future directions like knowledge distillation and cross-modality extension to broaden clinical utility.

Abstract

Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views. A Vision Transformer (ViT-Base) backbone extracts features, while a negative cosine similarity loss aligns feature representations. Pretraining is conducted over 50 epochs with a learning rate of 0.0001 and momentum of 0.999. Fine-tuning is performed on a stratified 5.129% subset of OCTMNIST using 10-fold cross-validation. ViT-2SPN achieves a mean AUC of 0.93, accuracy of 0.77, precision of 0.81, recall of 0.75, and an F1 score of 0.76, outperforming existing SSP-based methods.

Paper Structure

This paper contains 11 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (A) Self-supervised learning (SSL) pipeline for classification, (B) Self-supervised pertaining (SSP) approach for classification on the OCTMNIST dataset.
  • Figure 2: Sample images of the OCTMNIST dataset for retinal disease classification.
  • Figure 3: Architecture of vision transformer-based dual-stream self-supervised pretraining networks (ViT-2SPN) for retinal OCT classification.
  • Figure 4: Performance improvements for ViT-2SPN on the balanced OCTMNIST dataset.
  • Figure 5: Comparison of AUC curves for various SSP models on the balanced OCTMNIST dataset.
  • ...and 1 more figures