Table of Contents
Fetching ...

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Xiaohuan Pei, Tao Huang, Chang Xu

TL;DR

EfficientVMamba addresses the efficiency gap in vision models by combining an atrous, skip-sampled ES2D mechanism with a dual-path EVSS block that fuses global state-space processing with a local convolution branch. The architecture employs inverted insertion, placing global-capable EVSS blocks in early high-resolution stages and light-weight InRes blocks later, to balance global coherence with local detail. Across ImageNet, COCO, and ADE20K, EfficientVMamba variants achieve substantially reduced FLOPs while delivering competitive or superior accuracy compared to lightweight CNN/ViT baselines, with notable gains in small and tight-resource regimes. Ablation confirms the benefits of ES2D, SE-fused fusion, and early-stage global blocks, underscoring the practical value of integrating SSM-based global reasoning with efficient convolutions for scalable, edge-friendly vision models.

Abstract

Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands $\mathcal{O}(N^2)$. This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to $\mathcal{O}(N)$. Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with $1.3$G FLOPs improves Vim-Ti with $1.5$G FLOPs by a large margin of $5.6\%$ accuracy on ImageNet. Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}.

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

TL;DR

EfficientVMamba addresses the efficiency gap in vision models by combining an atrous, skip-sampled ES2D mechanism with a dual-path EVSS block that fuses global state-space processing with a local convolution branch. The architecture employs inverted insertion, placing global-capable EVSS blocks in early high-resolution stages and light-weight InRes blocks later, to balance global coherence with local detail. Across ImageNet, COCO, and ADE20K, EfficientVMamba variants achieve substantially reduced FLOPs while delivering competitive or superior accuracy compared to lightweight CNN/ViT baselines, with notable gains in small and tight-resource regimes. Ablation confirms the benefits of ES2D, SE-fused fusion, and early-stage global blocks, underscoring the practical value of integrating SSM-based global reasoning with efficient convolutions for scalable, edge-friendly vision models.

Abstract

Prior efforts in light-weight model development mainly centered on CNN and Transformer-based designs yet faced persistent challenges. CNNs adept at local feature extraction compromise resolution while Transformers offer global reach but escalate computational demands . This ongoing trade-off between accuracy and efficiency remains a significant hurdle. Recently, state space models (SSMs), such as Mamba, have shown outstanding performance and competitiveness in various tasks such as language modeling and computer vision, while reducing the time complexity of global information extraction to . Inspired by this, this work proposes to explore the potential of visual state space models in light-weight model design and introduce a novel efficient model variant dubbed EfficientVMamba. Concretely, our EfficientVMamba integrates a atrous-based selective scan approach by efficient skip sampling, constituting building blocks designed to harness both global and local representational features. Additionally, we investigate the integration between SSM blocks and convolutions, and introduce an efficient visual state space block combined with an additional convolution branch, which further elevate the model performance. Experimental results show that, EfficientVMamba scales down the computational complexity while yields competitive results across a variety of vision tasks. For example, our EfficientVMamba-S with G FLOPs improves Vim-Ti with G FLOPs by a large margin of accuracy on ImageNet. Code is available at: \url{https://github.com/TerryPei/EfficientVMamba}.
Paper Structure (18 sections, 8 equations, 3 figures, 8 tables)

This paper contains 18 sections, 8 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Lightweight Model Performance Comparison on ImageNet. EfficientVMamba outperforms previous work across various model variants in terms of both accuracy and computational complexity.
  • Figure 2: Illustration of efficient 2D scan methods (ES2D). (a.) Vmamba liu2024vmamba employs SS2D method in vision tasks, traversing entire row or column axes, which incurs heavy computational resources. (b.) We present an efficient 2D scanning method, ES2D, which organizes patches by omitting sampling steps, and then proceeds with an intra-group traversal (with a skipping step of 2 in the Figure). The proposed scan approach reduces computational demands ($4N \rightarrow N$) while preserving global feature maps (e.g. Each group contains eye-related patches.)
  • Figure 3: Architecture overview of EfficientVMamba. We hightlight our contributions with corresponding colors in the Figure. (1) ES2D\ref{['sec:ES2D']}: Atrous-based selective scanning strategy via skip sampling and regrouping in the spatial space. (2) EVSS\ref{['sec:EVSS']}: The EVSS block merges global and local feature extraction with modified ES2D and convolutional approaches enhanced by Squeeze-Excitation blocks for refined dual-pathway feature representation. Inverted Fusion\ref{['sec:Inverted']}: Inverted Fusion places local-representation modules in deep layers, deviating from traditional designs by utilizing EVSS blocks early for global representation and inverted residual blocks later for local feature extraction.