Table of Contents
Fetching ...

A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu

TL;DR

The paper tackles the challenge of achieving global information modeling in vision with linear computational complexity. It introduces Vision Mamba Inspired Separable Self-Attention (VMI-SA), which uses element-wise Q⊙K projections, context vectors, depthwise convolutions, and mask-based rank enhancement to fuse local and global cues without the quadratic costs of softmax attention. By deriving both recurrent and matrix-form realizations, the authors provide flexible pathways for efficient propagation of information, culminating in the demonstrative VMINet backbone. Experimental results on ImageNet-1K, COCO, and ADE20K show competitive accuracy and efficiency relative to ViMs and traditional backbones, supporting the practicality of VMI-SA for diverse vision tasks.

Abstract

Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: https://github.com/yws-wxs/VMINet.

A Separable Self-attention Inspired by the State Space Model for Computer Vision

TL;DR

The paper tackles the challenge of achieving global information modeling in vision with linear computational complexity. It introduces Vision Mamba Inspired Separable Self-Attention (VMI-SA), which uses element-wise Q⊙K projections, context vectors, depthwise convolutions, and mask-based rank enhancement to fuse local and global cues without the quadratic costs of softmax attention. By deriving both recurrent and matrix-form realizations, the authors provide flexible pathways for efficient propagation of information, culminating in the demonstrative VMINet backbone. Experimental results on ImageNet-1K, COCO, and ADE20K show competitive accuracy and efficiency relative to ViMs and traditional backbones, supporting the practicality of VMI-SA for diverse vision tasks.

Abstract

Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: https://github.com/yws-wxs/VMINet.
Paper Structure (19 sections, 13 equations, 4 figures, 6 tables)

This paper contains 19 sections, 13 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison with different modules. To facilitate a clear comparison, we uniformly adapt one-dimensional sequences as input, although this is not necessary for VMI-SA.
  • Figure 2: VMINet architecture overview.
  • Figure 3: Grad-CAM activation maps of the models trained on ImageNet-1K. The visualized images are from validation set.
  • Figure 4: The VMI-SA after removing attention-related operations. It can be observed that it shares the same overall structure as the ConvNeXt block, but differs in normalization methods and activation functions.