Table of Contents
Fetching ...

nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis

Xin Li, Wenhui Zhu, Xuanzhao Dong, Hao Wang, Yujian Xiong, Oana Dumitrascu, Yalin Wang

TL;DR

This work tackles efficient retinal image analysis by addressing the limitations of pure CNNs in modeling global context and preserving fine vascular structures. It introduces nnMobileNet++, a four-stage CNN–ViT hybrid that uses Dynamic Snake Convolution to maintain boundary integrity of vessels, stage-specific Transformer modules after the second downsampling, and SimMIM-based self-supervised pretraining on retinal data. Across six public datasets and MICCAI challenges, the model achieves state-of-the-art or competitive accuracy with substantially lower FLOPs and parameters, with pretraining providing additional gains. Grad-CAM visualizations and ablation studies confirm the contributions of DSC, ViT integration, and pretraining, highlighting its potential for robust, on-resource-device retinal analysis and clinical research applications.

Abstract

Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.

nnMobileNet++: Towards Efficient Hybrid Networks for Retinal Image Analysis

TL;DR

This work tackles efficient retinal image analysis by addressing the limitations of pure CNNs in modeling global context and preserving fine vascular structures. It introduces nnMobileNet++, a four-stage CNN–ViT hybrid that uses Dynamic Snake Convolution to maintain boundary integrity of vessels, stage-specific Transformer modules after the second downsampling, and SimMIM-based self-supervised pretraining on retinal data. Across six public datasets and MICCAI challenges, the model achieves state-of-the-art or competitive accuracy with substantially lower FLOPs and parameters, with pretraining providing additional gains. Grad-CAM visualizations and ablation studies confirm the contributions of DSC, ViT integration, and pretraining, highlighting its potential for robust, on-resource-device retinal analysis and clinical research applications.

Abstract

Retinal imaging is a critical, non-invasive modality for the early detection and monitoring of ocular and systemic diseases. Deep learning, particularly convolutional neural networks (CNNs), has significant progress in automated retinal analysis, supporting tasks such as fundus image classification, lesion detection, and vessel segmentation. As a representative lightweight network, nnMobileNet has demonstrated strong performance across multiple retinal benchmarks while remaining computationally efficient. However, purely convolutional architectures inherently struggle to capture long-range dependencies and model the irregular lesions and elongated vascular patterns that characterize on retinal images, despite the critical importance of vascular features for reliable clinical diagnosis. To further advance this line of work and extend the original vision of nnMobileNet, we propose nnMobileNet++, a hybrid architecture that progressively bridges convolutional and transformer representations. The framework integrates three key components: (i) dynamic snake convolution for boundary-aware feature extraction, (ii) stage-specific transformer blocks introduced after the second down-sampling stage for global context modeling, and (iii) retinal image pretraining to improve generalization. Experiments on multiple public retinal datasets for classification, together with ablation studies, demonstrate that nnMobileNet++ achieves state-of-the-art or highly competitive accuracy while maintaining low computational cost, underscoring its potential as a lightweight yet effective framework for retinal image analysis.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison of FLOPs, parameter size, and AUC on the MURED dataset. Each bubble corresponds to a model, where the bubble size is proportional to the number of parameters, the x-axis indicates computational cost (FLOPs), and the y-axis shows classification performance (AUC). The nnMobileNet++ (w/o DSC) denotes the architecture after removing the Dynamic Snake Convolution module.
  • Figure 2: (a) Overall architecture of our network, consisting of four stages where early stages are CNN-based and later stages are ViT-based. At the second-stage downsampling, a Dynamic Snake Convolution (DSC)qi2023dynamic is employed to better preserve curvilinear structures before feeding features into the ViTs. (b) The CNN blocks are built from nnMobileNet’s Inverted Residual Linear Bottleneck (IRLB), where Depthwise (DW) convolutions capture spatial information and Pointwise (PW) convolutions perform channel mixing. (c) The ViT modules combine local convolutions with global self-attention for contextual representation. "ch" denotes the number of input channels.
  • Figure 3: Examples of image quality in fundus images. The first row shows ungradable images excluded due to blur, poor illumination, or severe artifacts. The second row shows gradable images.
  • Figure 4: Examples of ultra-widefield (UWF) fundus images from the challenge dataset. The diversity of these images introduces challenges such as peripheral distortion, illumination non-uniformity, anatomical variability, and device/domain shifts.
  • Figure 5: Class activation maps (CAMs) selvaraju2017grad generated by different networks. The first two rows show Ultra-Widefield Fundus Imaging (UWF) samples from the UWF4DR dataset, and the last two rows show color fundus photography (CFP) samples from the MURED dataset. Compared to existing baselines (nnMobilenet zhu2024nnmobilenet, EfficientNetV2 tan2019efficientnet, Swin Transformer liu2021swin, MobileViT mehta2021mobilevit, LeViT graham2021levit, and ViT-Base dosovitskiy2020image), our proposed nnMobileNet++ achieves consistently more accurate and focused attention on lesion regions across both UWF and CFP modalities.