Table of Contents
Fetching ...

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

Haoyang He, Jiangning Zhang, Yuxuan Cai, Hongxu Chen, Xiaobin Hu, Zhenye Gan, Yabiao Wang, Chengjie Wang, Yunsheng Wu, Lei Xie

TL;DR

This work proposes the MobileMamba framework, which balances efficiency and performance, and introduces the Multi-Receptive Field Feature Interaction (MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba, Efficient Multi-Kernel Depthwise Convolution, and Eliminate Redundant Identity components.

Abstract

Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction(MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba(WTE-Mamba), Efficient Multi-Kernel Depthwise Convolution(MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

TL;DR

This work proposes the MobileMamba framework, which balances efficiency and performance, and introduces the Multi-Receptive Field Feature Interaction (MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba, Efficient Multi-Kernel Depthwise Convolution, and Eliminate Redundant Identity components.

Abstract

Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction(MRFFI) module, comprising the Long-Range Wavelet Transform-Enhanced Mamba(WTE-Mamba), Efficient Multi-Kernel Depthwise Convolution(MK-DeConv), and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.

Paper Structure

This paper contains 23 sections, 7 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Top: Visualization of the Effective Receptive Fields (ERF) for different architectures. Bottom: Performance vs. FLOPs with recent CNN/Transformer/Mamba-based methods.
  • Figure 2: Accuracy vs. Speed with Mamba-based methods.
  • Figure 3: Coarse-Grained Design. (A) illustrates the structure of a commonly used four-stage network, where the first two stages can be configured with either (1) a purely CNN-based structure or (2) the MobileMamba structure. (B) depicts the three-stage network structure employed in this study. The following table presents the model parameters for different structures and the ImageNet-1K Top-1 and Top-5 at equivalent throughput.
  • Figure 4: Overview of MobileMamba. (a) Architecture of MobileMamba. (b) 16 $\times$16 DownSample PatchEmbed. (c) Structure of MobileMamba Block. (d) Fine-Grained Design. The proposed efficient Multi-Receptive Field Feature Interaction (MRFFI) module.
  • Figure 5: Incremental Experiments on the ImageNet-1K for MobileMamba compare Top1/Top5 Acc., FLOPs, and Throughput.
  • ...and 1 more figures