Table of Contents
Fetching ...

When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks

David Isztl, Tahm Spitznagel, Gabor Mark Somfai, Rui Santos

TL;DR

This study interrogates whether large retina-specific foundation models are necessary for retinal disease classification. Through a controlled benchmark of 12–13 backbones across OCT and CFP tasks, it shows compact 27–29M parameter pretrained architectures nearly always match or outperform larger domain-specific models, with pretraining benefits larger for CFP tasks and for harder ordinal DR grading. Domain-specific retinal pretraining proves valuable primarily for the most challenging DR task, while ImageNet pretraining suffices for other tasks, highlighting task-dependent resource allocation. The results challenge the 'bigger is better' paradigm in medical vision foundation models and offer practical guidance on architecture choice, pretraining strategy, and modality-aware deployment. Overall, compact hierarchical pretrained models deliver near-optimal performance for most retinal imaging applications, reserving large retina-specific pretraining for boundary cases requiring fine-grained discrimination under severe class imbalance.

Abstract

Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.

When Do Domain-Specific Foundation Models Justify Their Cost? A Systematic Evaluation Across Retinal Imaging Tasks

TL;DR

This study interrogates whether large retina-specific foundation models are necessary for retinal disease classification. Through a controlled benchmark of 12–13 backbones across OCT and CFP tasks, it shows compact 27–29M parameter pretrained architectures nearly always match or outperform larger domain-specific models, with pretraining benefits larger for CFP tasks and for harder ordinal DR grading. Domain-specific retinal pretraining proves valuable primarily for the most challenging DR task, while ImageNet pretraining suffices for other tasks, highlighting task-dependent resource allocation. The results challenge the 'bigger is better' paradigm in medical vision foundation models and offer practical guidance on architecture choice, pretraining strategy, and modality-aware deployment. Overall, compact hierarchical pretrained models deliver near-optimal performance for most retinal imaging applications, reserving large retina-specific pretraining for boundary cases requiring fine-grained discrimination under severe class imbalance.

Abstract

Large vision foundation models have been widely adopted for retinal disease classification without systematic evidence justifying their parameter requirements. In the present work we address two critical questions: First, are large domain-specific foundation models essential, or do compact general-purpose architectures suffice? Second, does specialized retinal pretraining justify its computational cost? To answer this, we benchmark initialization strategies across four retinal imaging classification tasks spanning Optical Coherence Tomography (OCT) and Color Fundus Photography (CFP) modalities: 8-class OCT classification, 3-class diabetic macular edema (DME), 5-class diabetic retinopathy (DR), and 3-class glaucoma (GL) detection. We evaluate 12-13 model configurations per task, including vision transformers (22.8M-86.6M parameters), Swin Transformers (27.6M-28.3M), ConvNeXt (28.6M), and the domain-specific RETFound models (303M), under identical training conditions. Our results challenge prevailing assumptions: First, we demonstrate that pretraining provides universal benefits (5.18-18.41% improvement), scaling with task difficulty. Second, compact architectures (27-29M) dominate Pareto frontiers; SwinV2-tiny achieves top-1 performance on three datasets. Third, RETFound (303M) justifies its computational cost only for challenging DR grading (accuracy of 71.15%), while ImageNet pretraining proves to be sufficient with all other tasks (DME accuracy: 99.24%, OCT accuracy: 97.96%). CFP tasks show larger pretraining accuracy gains (9.13-18.41%) than OCT (5.18%). Thus, the evidence suggests that compact general-purpose models deliver near-optimal performance for most retinal classification tasks; specialized foundation models warranted only for fine-grained discrimination under extreme class imbalance.

Paper Structure

This paper contains 56 sections, 24 figures, 14 tables.

Figures (24)

  • Figure 1: OCT: Multi-panel box plot comparison of pretrained vs scratch-trained models across four key metrics. Each panel shows box-and-whisker plots where boxes represent the interquartile range (IQR, 25th--75th percentiles), horizontal lines show medians, whiskers extend to 1.5$\times$IQR, and individual points indicate outliers. (A) Best Validation Accuracy: Pretrained models (n=9) achieve mean 96.94% $\pm$ 0.88% vs scratch (n=4) 91.76% $\pm$ 3.37%, representing a 5.18 percentage point improvement (p < 0.05, Mann-Whitney U test) with 3.8$\times$ variance reduction. (B) AUROC (macro): Pretrained models achieve 99.87% $\pm$ 0.04% vs scratch 99.24% $\pm$ 0.39%, a 0.63% improvement (p < 0.05) demonstrating superior threshold-independent discrimination. (C) F1-Score (macro): Pretrained 96.94% $\pm$ 0.88% vs scratch 91.76% $\pm$ 3.38%, mirroring accuracy patterns. (D) Cohen's Kappa: Pretrained 96.50% $\pm$ 1.00% vs scratch 90.58% $\pm$ 3.85%, showing 5.92 percentage point improvement in chance-adjusted agreement. Across all metrics, pretrained models show dramatically reduced variance and higher central tendency, indicating that ImageNet initialization provides universal, consistent benefits for the 8-class OCT classification.
  • Figure 2: OCT: Multi-panel horizontal bar chart comparing all 13 model configurations across four key metrics. Each panel shows models on the Y-axis (sorted by performance, bottom: best) with metric values [0.0--1.0] on the X-axis. Color coding distinguishes pretrained (blue bars) from scratch-trained (green bars) models. (A) AUROC (macro): Threshold-independent discrimination; pretrained models cluster at 0.9979--0.9991, while scratch models range 0.9876--0.9968. All models achieve excellent AUROC >0.985, indicating strong class separability for the 8-class OCT. (B) F1-Score (macro): Harmonic mean of precision/recall; pretrained models 0.9565--0.9797, scratch 0.8802--0.9558. F1-Score shows the same pretrained-scratch separation pattern as accuracy. (C) Cohen's Kappa: Chance-adjusted agreement; pretrained 0.9502--0.9767, scratch 0.8633--0.9494. The 8--10 percentage point gap demonstrates pretraining's value beyond chance-level agreement. (D) Accuracy: Pretrained models dominate top positions (0.9565--0.9796), with ConvNeXtV2-tiny and SwinV2-tiny leading. All pretrained models exceed 0.95, while scratch models span 0.8804--0.9557. Consistent stratification across all panels confirms that ImageNet initialization provides universal benefits regardless of metric choice.
  • Figure 3: OCT: Pareto frontier analysis identifying objectively optimal models in the accuracy-parameter trade-off space. Models on the frontier are Pareto-optimal: no other model achieves higher accuracy without requiring more parameters. The x-axis shows model size, while y-axis shows validation accuracy. Three models dominate the frontier: DinoV2-small (22.8M, 95.86%), SwinV2-tiny (27.6M, 97.93%), and ConvNeXtV2-tiny (28.6M, 97.96%). Critically, all frontier models cluster in the 23--29M parameter range; larger models including RETFound-MAE-OCT (303M, 97.29%) and DinoV2-small-reg (86.6M, 97.14%) fall below the frontier, demonstrating that increased model size does not yield proportional accuracy gains for this 8-class OCT classification task. This challenges the assumption that 300M+ parameter foundation models are necessary for optimal performance.
  • Figure 4: OCT: Multi-panel scatter plots showing model performance versus parameter count across three metrics. Pretrained models shown in blue, scratch-trained in green. Each point represents one model configuration. Pearson correlation statistics (r, p-value) quantify the linear relationship between model size and performance. (A) Accuracy vs Parameter Count: Pretrained models (0.9564--0.9796) consistently outperform scratch models (0.8804--0.9557) by 5--9 percentage points regardless of size. Performance saturates at 28--30M parameters ( 0.98 accuracy); larger models (86.6M ViT, 303M RETFound) provide minimal gains, demonstrating diminishing returns. The vertical separation between green/blue clusters visualizes the 5.18% mean pretraining advantage. No significant correlation between size and accuracy ($p = 0.65$) confirms that larger models do not systematically outperform smaller ones. (B) AUROC vs Parameter Count: Similar saturation pattern; pretrained models cluster at 0.9979--0.9991, while scratch models range 0.9876--0.9968. No significant correlation ($p = 0.47$). (C) F1-Score vs Parameter Count: Mirrors accuracy patterns; pretrained 0.9565--0.9797, scratch 0.8633--0.9494. No significant correlation ($p = 0.65$). Consistent across panels: compact pretrained models (23--29M) achieve near-optimal performance, while scaling to 300M+ parameters yields negligible benefits.
  • Figure 5: OCT: Multi-panel horizontal bar chart showing parameter efficiency (metric value per 100M parameters) across all 13 model configurations. Each panel shows models on the Y-axis (with parameter counts) sorted by efficiency (bottom: most efficient), with efficiency values [0.0--5.0] on the X-axis. Color coding distinguishes pretrained (blue) from scratch (green) models. (A) Accuracy Efficiency: DinoV2-small leads with 4.204 points/100M params (95.86% accuracy, 22.8M params)—only 2.1 percentage points below best while using 13$\times$ fewer parameters than RETFound. SwinV2-tiny (3.548, 97.93%) and ConvNeXtV2-tiny (3.425, 97.96%) provide optimal efficiency-performance balance. Scratch models show poorer efficiency. (B) AUROC Efficiency: Similar pattern with pretrained models dominating. (C) F1 Efficiency: DinoV2-small again leads (4.203), followed by compact pretrained models (3.4--3.6). (D) Kappa Efficiency: Pretrained models <4.2, scratch <3.4. Consistent across panels: compact pretrained models deliver superior efficiency; large specialized models (RETFound: 0.31--0.32 efficiency) show poor parameter utilization for this OCT classification task.
  • ...and 19 more figures