Table of Contents
Fetching ...

SF-Mamba: Rethinking State Space Model for Vision

Masakazu Yoshimura, Teruaki Hayashi, Yuki Hoshino, Wei-Yao Wang, Takeshi Ohashi

Abstract

The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.

SF-Mamba: Rethinking State Space Model for Vision

Abstract

The realm of Mamba for vision has been advanced in recent years to strike for the alternatives of Vision Transformers (ViTs) that suffer from the quadratic complexity. While the recurrent scanning mechanism of Mamba offers computational efficiency, it inherently limits non-causal interactions between image patches. Prior works have attempted to address this limitation through various multi-scan strategies; however, these approaches suffer from inefficiencies due to suboptimal scan designs and frequent data rearrangement. Moreover, Mamba exhibits relatively slow computational speed under short token lengths, commonly used in visual tasks. In pursuit of a truly efficient vision encoder, we rethink the scan operation for vision and the computational efficiency of Mamba. To this end, we propose SF-Mamba, a novel visual Mamba with two key proposals: auxiliary patch swapping for encoding bidirectional information flow under an unidirectional scan and batch folding with periodic state reset for advanced GPU parallelism. Extensive experiments on image classification, object detection, and instance and semantic segmentation consistently demonstrate that our proposed SF-Mamba significantly outperforms state-of-the-art baselines while improving throughput across different model sizes. We will release the source code after publication.
Paper Structure (30 sections, 7 equations, 9 figures, 14 tables)

This paper contains 30 sections, 7 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Top-1 accuracy and throughput on ImageNet-1K classification. SF-Mamba offers superior accuracy–throughput trade-offs compared to state-of-the-art architectures.
  • Figure 2: Future-to-Past Information Routing via Auxiliary Token Swapping. The left figure illustrates why the commonly used multi-directional scan in visual Mamba fails to achieve high speed, while the right figure presents our proposed solution. We prepend/append learnable auxiliary tokens to the patch sequence $x^{\text{aux}}_{\text{head}}$ and $x^{\text{aux}}_{\text{tail}}$. Within each MambaVision block, the causal selective scan aggregates sequence-wide context into the tail token $y^{\text{aux}}_{\text{tail}}$. A lightweight, parameter-free Swap operation then moves this global summary to the sequence head, yielding $\tilde{X}$ for the next layer such that all patch states are conditioned on global context. It incurs negligible computational overhead while enabling effective global-context propagation across layers.
  • Figure 3: Batch folding with periodic state reset. (Left) An input tensor of shape $[B, D, T]$ is reshaped into $[B_1, D, (B_2 \cdot T)]$, concatenating $B_2$ short sequences into a longer one. This reshaping mixes hidden states across batches. (Right) To avoid information leakage, we reset the recurrence every $T$ steps. Since $h_t \gets A_t h_{t-1} + B_t x_t$, setting $A_t=0$ at boundaries is equivalent to re-initializing the hidden state. In contrast, $B_t$ (input projection) and $C_t$ (output projection) operate locally and therefore remain unchanged.
  • Figure 4: How much we can speedup the SSM calculation by changing $B_1$. The four configurations of [batch size, dimension, state dimension, sequence length] are exact settings for ours-T stage 3, ours-T stage 4, ours-B stage 3, ours-B stage 4.
  • Figure 5: Throughput–accuracy trade-off on ADE20K. The x-axis denotes frames per second with batch size 1 setting (higher is better), and the y-axis shows mIoU (higher is better). SF-Mamba variants lie on the Pareto front.
  • ...and 4 more figures