OCT Data is All You Need: How Vision Transformers with and without Pre-training Benefit Imaging
Zihao Han, Philippe De Wilde
TL;DR
This study interrogates whether ImageNet-based pre-training benefits Vision Transformers for OCT image classification across different dataset sizes. By comparing ViT models trained from scratch versus initialized with ImageNet21K weights on a four-class OCT task (CNV, DME, Drusen, Normal), it finds that pre-training accelerates early convergence in small data scenarios but, with sufficient OCT data, scratch training achieves comparable or slightly better final accuracy. The results highlight a substantial domain gap between OCT and natural images as a key factor limiting pre-training gains and motivate OCT-specific pre-training or self-supervised strategies. The work provides practical guidance for transfer-learning choices in ophthalmic imaging and points to OCT-focused pre-training as a promising avenue for future improvements.
Abstract
Optical Coherence Tomography (OCT) provides high-resolution cross-sectional images useful for diagnosing various diseases, but their distinct characteristics from natural images raise questions about whether large-scale pre-training on datasets like ImageNet is always beneficial. In this paper, we investigate the impact of ImageNet-based pre-training on Vision Transformer (ViT) performance for OCT image classification across different dataset sizes. Our experiments cover four-category retinal pathologies (CNV, DME, Drusen, Normal). Results suggest that while pre-training can accelerate convergence and potentially offer better performance in smaller datasets, training from scratch may achieve comparable or even superior accuracy when sufficient OCT data is available. Our findings highlight the importance of matching domain characteristics in pre-training and call for further study on large-scale OCT-specific pre-training.
