A Separable Self-attention Inspired by the State Space Model for Computer Vision
Juntao Zhang, Shaogeng Liu, Kun Bian, You Zhou, Pei Zhang, Jianning Liu, Jun Zhou, Bingyan Liu
TL;DR
The paper tackles the challenge of achieving global information modeling in vision with linear computational complexity. It introduces Vision Mamba Inspired Separable Self-Attention (VMI-SA), which uses element-wise Q⊙K projections, context vectors, depthwise convolutions, and mask-based rank enhancement to fuse local and global cues without the quadratic costs of softmax attention. By deriving both recurrent and matrix-form realizations, the authors provide flexible pathways for efficient propagation of information, culminating in the demonstrative VMINet backbone. Experimental results on ImageNet-1K, COCO, and ADE20K show competitive accuracy and efficiency relative to ViMs and traditional backbones, supporting the practicality of VMI-SA for diverse vision tasks.
Abstract
Mamba is an efficient State Space Model (SSM) with linear computational complexity. Although SSMs are not suitable for handling non-causal data, Vision Mamba (ViM) methods still demonstrate good performance in tasks such as image classification and object detection. Recent studies have shown that there is a rich theoretical connection between state space models and attention variants. We propose a novel separable self attention method, for the first time introducing some excellent design concepts of Mamba into separable self-attention. To ensure a fair comparison with ViMs, we introduce VMINet, a simple yet powerful prototype architecture, constructed solely by stacking our novel attention modules with the most basic down-sampling layers. Notably, VMINet differs significantly from the conventional Transformer architecture. Our experiments demonstrate that VMINet has achieved competitive results on image classification and high-resolution dense prediction tasks.Code is available at: https://github.com/yws-wxs/VMINet.
