RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
Timing Yang, Guoyizhe Wei, Alan Yuille, Feng Wang
TL;DR
The paper addresses understanding Vision Mamba’s representational capacity by situating it between Softmax-based self-attention and linear attention through a rank-based analysis of a unified $Y = \mathbf{M}\mathbf{X}$ framework. It demonstrates that Mamba acts as a low-rank approximation of Softmax Attention, while preserving linear scalability via a learnable, data-dependent mask, yielding higher off-diagonal rank than Linear Attention. The authors introduce a Binary-AUC metric to quantify feature-map discriminability and show that DINO self-supervised pretraining produces clearer activation maps, with Mamba achieving strong linear probing performance on ImageNet. Across extensive experiments, Mamba consistently outperforms Linear Attention on long-range visual tasks and approaches Transformer-level capabilities, offering a principled, scalable alternative with improved interpretability for vision architectures.
Abstract
Mamba has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba's representational properties and make three primary contributions. First, we theoretically analyze Mamba's relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba's capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba's potential for interpretability. Notably, our model also achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.
