nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis
Xin Li, Wenhui Zhu, Xuanzhao Dong, Hao Wang, Yujian Xiong, Oana Dumitrascu, Yalin Wang
TL;DR
This work tackles efficient retinal image analysis by addressing the limitations of pure CNNs in modeling global context and preserving fine vascular structures. It introduces nnMobileNet++, a four-stage CNN–ViT hybrid that uses Dynamic Snake Convolution to maintain boundary integrity of vessels, stage-specific Transformer modules after the second downsampling, and SimMIM-based self-supervised pretraining on retinal data. Across six public datasets and MICCAI challenges, the model achieves state-of-the-art or competitive accuracy with substantially lower FLOPs and parameters, with pretraining providing additional gains. Grad-CAM visualizations and ablation studies confirm the contributions of DSC, ViT integration, and pretraining, highlighting its potential for robust, on-resource-device retinal analysis and clinical research applications.
Abstract
Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.
