Table of Contents
Fetching ...

LocalMamba: Visual State Space Model with Windowed Selective Scan

Tao Huang, Xiaohuan Pei, Shan You, Fei Wang, Chen Qian, Chang Xu

TL;DR

LocalMamba introduces windowed local scanning for Vision State Space Models to preserve 2D spatial locality while maintaining global context. It couples a four-branch local scan with a differentiable, layer-wise search over scan directions (inspired by DARTS) to adapt scanning to each layer. The approach yields consistent improvements over Vim and VMamba across ImageNet, COCO, and ADE20K while preserving comparable FLOPs, demonstrated through extensive experiments and ablations. This work highlights the importance of scan strategy in visual SSMs and offers a practical path toward more accurate and efficient vision backbones.

Abstract

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

LocalMamba: Visual State Space Model with Windowed Selective Scan

TL;DR

LocalMamba introduces windowed local scanning for Vision State Space Models to preserve 2D spatial locality while maintaining global context. It couples a four-branch local scan with a differentiable, layer-wise search over scan directions (inspired by DARTS) to adapt scanning to each layer. The approach yields consistent improvements over Vim and VMamba across ImageNet, COCO, and ADE20K while preserving comparable FLOPs, demonstrated through extensive experiments and ablations. This work highlights the importance of scan strategy in visual SSMs and offers a practical path toward more accurate and efficient vision backbones.

Abstract

Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.
Paper Structure (22 sections, 5 equations, 5 figures, 6 tables)

This paper contains 22 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of scan methods. (a) and (b): Previous methods Vim zhu2024vision and VMamba liu2024vmamba traverse the entire row or column axis, resulting in significant distances for capturing dependencies between neighboring pixels within the same semantic region (e.g., the left eye in the image). (c) We introduce a novel scan method that partitions tokens into distinct windows, facilitating traversal within each window (window size is $3\times 3$ here). This approach enhances the ability to capture local dependencies.
  • Figure 2: By extending the original scan with our local scan mechanism, our method significantly improves the ImageNet accuracies of Vim zhu2024vision while keeping similar FLOPs.
  • Figure 3: (a) Structure of the LocalVim model. (b) Illustration of the proposed spatial and channel attention module (SCAttn).
  • Figure 4: Visualization of the searched directions of our models. The visualization of LocalVMamba-S is in Section \ref{['sec:vis_searched_s']}.
  • Figure 5: Visualization of the searched directions of LocalVMamba-S.