When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks
David Isztl, Tahm Spitznagel, Gabor Mark Somfai, Rui Santos
TL;DR
This study interrogates whether large retina-specific foundation models are necessary for retinal disease classification. Through a controlled benchmark of 12–13 backbones across OCT and CFP tasks, it shows compact 27–29M parameter pretrained architectures nearly always match or outperform larger domain-specific models, with pretraining benefits larger for CFP tasks and for harder ordinal DR grading. Domain-specific retinal pretraining proves valuable primarily for the most challenging DR task, while ImageNet pretraining suffices for other tasks, highlighting task-dependent resource allocation. The results challenge the 'bigger is better' paradigm in medical vision foundation models and offer practical guidance on architecture choice, pretraining strategy, and modality-aware deployment. Overall, compact hierarchical pretrained models deliver near-optimal performance for most retinal imaging applications, reserving large retina-specific pretraining for boundary cases requiring fine-grained discrimination under severe class imbalance.
Abstract
Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.
