A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang; Shaogeng Liu; Kun Bian; You Zhou; Pei Zhang; Jianning Liu; Jun Zhou; Bingyan Liu

A Separable Self-attention Inspired by the State Space Model for Computer Vision

Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu

TL;DR

The paper tackles the challenge of achieving global information modeling in vision with linear computational complexity. It introduces Vision Mamba Inspired Separable Self-Attention (VMI-SA), which uses element-wise Q⊙K projections, context vectors, depthwise convolutions, and mask-based rank enhancement to fuse local and global cues without the quadratic costs of softmax attention. By deriving both recurrent and matrix-form realizations, the authors provide flexible pathways for efficient propagation of information, culminating in the demonstrative VMINet backbone. Experimental results on ImageNet-1K, COCO, and ADE20K show competitive accuracy and efficiency relative to ViMs and traditional backbones, supporting the practicality of VMI-SA for diverse vision tasks.

Abstract

Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: https://github.com/yws-wxs/VMINet.

A Separable Self-attention Inspired by the State Space Model for Computer Vision

TL;DR

Abstract

Paper Structure (19 sections, 13 equations, 4 figures, 6 tables)

This paper contains 19 sections, 13 equations, 4 figures, 6 tables.

Preliminaries
Softmax Self-Attention
Separable Self-Attention
Structured State Space Model
Methodology
Element-wise Multiplication Instead of Matrix Multiplication
Context Vector Instead of Attention Matrix
Vision Mamba Inspired Separable Self-Attention
Excellent Design in Mamba
Macro Design
Recurrent Form
Matrix Form
VMINet
Experiments
Image Classification on ImageNet-1K
...and 4 more sections

Figures (4)

Figure 1: Comparison with different modules. To facilitate a clear comparison, we uniformly adapt one-dimensional sequences as input, although this is not necessary for VMI-SA.
Figure 2: VMINet architecture overview.
Figure 3: Grad-CAM activation maps of the models trained on ImageNet-1K. The visualized images are from validation set.
Figure 4: The VMI-SA after removing attention-related operations. It can be observed that it shares the same overall structure as the ConvNeXt block, but differs in normalization methods and activation functions.

A Separable Self-attention Inspired by the State Space Model for Computer Vision

TL;DR

Abstract

A Separable Self-attention Inspired by the State Space Model for Computer Vision

Authors

TL;DR

Abstract

Table of Contents

Figures (4)