Table of Contents
Fetching ...

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Yuheng Shi, Minjing Dong, Chang Xu

TL;DR

This work tackles long-range dependency modeling under parameter constraints in vision backbones by introducing Multi-Scale VMamba (MSVMamba). It replaces the single-scale SSM scanning with Multi-Scale 2D (MS2D) scanning and adds a ConvFFN to improve channel mixing, forming the MS3 block (MSVSS+ConvFFN). The approach reduces redundancy and computational cost while preserving global receptive fields, achieving 82.8% top-1 on ImageNet for MSVMamba-Tiny and state-of-the-art results on COCO and ADE20K among SSM-based models. The results demonstrate an effective balance between efficiency and accuracy for vision tasks like classification, detection, and segmentation.

Abstract

Despite the significant achievements of Vision Transformers (ViTs) in various vision tasks, they are constrained by the quadratic complexity. Recently, State Space Models (SSMs) have garnered widespread attention due to their global receptive field and linear complexity with respect to the input length, demonstrating substantial potential across fields including natural language processing and computer vision. To improve the performance of SSMs in vision tasks, a multi-scan strategy is widely adopted, which leads to significant redundancy of SSMs. For a better trade-off between efficiency and performance, we analyze the underlying reasons behind the success of the multi-scan strategy, where long-range dependency plays an important role. Based on the analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the superiority of SSMs in vision tasks with limited parameters. It employs a multi-scale 2D scanning technique on both original and downsampled feature maps, which not only benefits long-range dependency learning but also reduces computational costs. Additionally, we integrate a Convolutional Feed-Forward Network (ConvFFN) to address the lack of channel mixing. Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU with single-scale testing on ADE20K.Code is available at \url{https://github.com/YuHengsss/MSVMamba}.

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

TL;DR

This work tackles long-range dependency modeling under parameter constraints in vision backbones by introducing Multi-Scale VMamba (MSVMamba). It replaces the single-scale SSM scanning with Multi-Scale 2D (MS2D) scanning and adds a ConvFFN to improve channel mixing, forming the MS3 block (MSVSS+ConvFFN). The approach reduces redundancy and computational cost while preserving global receptive fields, achieving 82.8% top-1 on ImageNet for MSVMamba-Tiny and state-of-the-art results on COCO and ADE20K among SSM-based models. The results demonstrate an effective balance between efficiency and accuracy for vision tasks like classification, detection, and segmentation.

Abstract

Despite the significant achievements of Vision Transformers (ViTs) in various vision tasks, they are constrained by the quadratic complexity. Recently, State Space Models (SSMs) have garnered widespread attention due to their global receptive field and linear complexity with respect to the input length, demonstrating substantial potential across fields including natural language processing and computer vision. To improve the performance of SSMs in vision tasks, a multi-scan strategy is widely adopted, which leads to significant redundancy of SSMs. For a better trade-off between efficiency and performance, we analyze the underlying reasons behind the success of the multi-scan strategy, where long-range dependency plays an important role. Based on the analysis, we introduce Multi-Scale Vision Mamba (MSVMamba) to preserve the superiority of SSMs in vision tasks with limited parameters. It employs a multi-scale 2D scanning technique on both original and downsampled feature maps, which not only benefits long-range dependency learning but also reduces computational costs. Additionally, we integrate a Convolutional Feed-Forward Network (ConvFFN) to address the lack of channel mixing. Our experiments demonstrate that MSVMamba is highly competitive, with the MSVMamba-Tiny model achieving 82.8% top-1 accuracy on ImageNet, 46.9% box mAP, and 42.2% instance mAP with the Mask R-CNN framework, 1x training schedule on COCO, and 47.6% mIoU with single-scale testing on ADE20K.Code is available at \url{https://github.com/YuHengsss/MSVMamba}.
Paper Structure (16 sections, 9 equations, 6 figures, 7 tables)

This paper contains 16 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison on ImageNet.
  • Figure 2: Illustration of decay along horizontal, vertical scanning routes and their ratio.
  • Figure 3: Illustration of the Multi-Scale 2D-Selective-Scan on an image
  • Figure 4: Illustration of the decay with different scanning routes in SS2D and MS2D.
  • Figure 5: Detailed architecture of Multi-Scale State Space (MS3) block, consisting of a Multi-Scale Vision Space State (MSVSS) block and a Convolutional Feed-Forward Network (ConvFFN) block.
  • ...and 1 more figures