Table of Contents
Fetching ...

Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion

Chaodong Xiao, Minghan Li, Zhengqiang Zhang, Deyu Meng, Lei Zhang

TL;DR

Spatial-Mamba introduces a structure-aware state fusion (SASF) that infuses local 2D spatial structure into visual state-space models, enabling effective 3-stage processing from unidirectional state computation to SASF-driven fusion and final observation. By expressing Spatial-Mamba, original Mamba, and linear attention within a shared matrix-multiplication framework, the approach unifies long-range and local context modeling with linear complexity. Empirical results across ImageNet-1K, COCO, and ADE20K show Spatial-Mamba achieves or surpasses state-of-the-art performance with a single scan, and ablations demonstrate the efficacy of neighborhood fusion and multi-scale dilation. This work advances visual SSMs by preserving spatial structure more faithfully and offering a practical, efficient alternative to multi-branch or dense attention methods for vision tasks.

Abstract

Selective state space models (SSMs), such as Mamba, highly excel at capturing long-range dependencies in 1D sequential data, while their applications to 2D vision tasks still face challenges. Current visual SSMs often convert images into 1D sequences and employ various scanning patterns to incorporate local spatial dependencies. However, these methods are limited in effectively capturing the complex image spatial structures and the increased computational cost caused by the lengthened scanning paths. To address these limitations, we propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. Instead of relying solely on sequential state transitions, we introduce a structure-aware state fusion equation, which leverages dilated convolutions to capture image spatial structural dependencies, significantly enhancing the flow of visual contextual information. Spatial-Mamba proceeds in three stages: initial state computation in a unidirectional scan, spatial context acquisition through structure-aware state fusion, and final state computation using the observation equation. Our theoretical analysis shows that Spatial-Mamba unifies the original Mamba and linear attention under the same matrix multiplication framework, providing a deeper understanding of our method. Experimental results demonstrate that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation. Source codes and trained models can be found at https://github.com/EdwardChasel/Spatial-Mamba.

Spatial-Mamba: Effective Visual State Space Models via Structure-aware State Fusion

TL;DR

Spatial-Mamba introduces a structure-aware state fusion (SASF) that infuses local 2D spatial structure into visual state-space models, enabling effective 3-stage processing from unidirectional state computation to SASF-driven fusion and final observation. By expressing Spatial-Mamba, original Mamba, and linear attention within a shared matrix-multiplication framework, the approach unifies long-range and local context modeling with linear complexity. Empirical results across ImageNet-1K, COCO, and ADE20K show Spatial-Mamba achieves or surpasses state-of-the-art performance with a single scan, and ablations demonstrate the efficacy of neighborhood fusion and multi-scale dilation. This work advances visual SSMs by preserving spatial structure more faithfully and offering a practical, efficient alternative to multi-branch or dense attention methods for vision tasks.

Abstract

Selective state space models (SSMs), such as Mamba, highly excel at capturing long-range dependencies in 1D sequential data, while their applications to 2D vision tasks still face challenges. Current visual SSMs often convert images into 1D sequences and employ various scanning patterns to incorporate local spatial dependencies. However, these methods are limited in effectively capturing the complex image spatial structures and the increased computational cost caused by the lengthened scanning paths. To address these limitations, we propose Spatial-Mamba, a novel approach that establishes neighborhood connectivity directly in the state space. Instead of relying solely on sequential state transitions, we introduce a structure-aware state fusion equation, which leverages dilated convolutions to capture image spatial structural dependencies, significantly enhancing the flow of visual contextual information. Spatial-Mamba proceeds in three stages: initial state computation in a unidirectional scan, spatial context acquisition through structure-aware state fusion, and final state computation using the observation equation. Our theoretical analysis shows that Spatial-Mamba unifies the original Mamba and linear attention under the same matrix multiplication framework, providing a deeper understanding of our method. Experimental results demonstrate that Spatial-Mamba, even with a single scan, attains or surpasses the state-of-the-art SSM-based models in image classification, detection and segmentation. Source codes and trained models can be found at https://github.com/EdwardChasel/Spatial-Mamba.

Paper Structure

This paper contains 20 sections, 8 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Illustration of the scanning patterns of existing visual SSMs (from sub-figures (a) to (c)) and our proposed Spatial-Mamba with structure-aware state fusion (sub-figure (d)).
  • Figure 2: Illustrations of the SSM in (a) Mamba and (b) our Spatial-Mamba, where the residual term ${\bm{D}}$ is omitted. In (b), 'Fusion' refers to our proposed structure-aware state fusion (SASF) equation.
  • Figure 3: Visualization of state variables before and after applying the SASF equation. Sub-figures (b) and (c) show the mean of state variables across all channels in the first layer of Spatial-Mamba, while sub-figures (d) and (e) display the state variables for a specific channel in the last layer.
  • Figure 4: Overall network architecture of Spatial-Mamba.
  • Figure 5: Visualizations of matrices ${\bm{M}}$ and the corresponding activation maps for linear attention, Mamba and Spatial-Mamba. The red arrows indicate specific rows in matrices ${\bm{M}}$, along with the corresponding image patches (marked with a red star).
  • ...and 4 more figures