Table of Contents
Fetching ...

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

Timing Yang, Guoyizhe Wei, Alan Yuille, Feng Wang

TL;DR

The paper addresses understanding Vision Mamba’s representational capacity by situating it between Softmax-based self-attention and linear attention through a rank-based analysis of a unified $Y = \mathbf{M}\mathbf{X}$ framework. It demonstrates that Mamba acts as a low-rank approximation of Softmax Attention, while preserving linear scalability via a learnable, data-dependent mask, yielding higher off-diagonal rank than Linear Attention. The authors introduce a Binary-AUC metric to quantify feature-map discriminability and show that DINO self-supervised pretraining produces clearer activation maps, with Mamba achieving strong linear probing performance on ImageNet. Across extensive experiments, Mamba consistently outperforms Linear Attention on long-range visual tasks and approaches Transformer-level capabilities, offering a principled, scalable alternative with improved interpretability for vision architectures.

Abstract

Mamba has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba's representational properties and make three primary contributions. First, we theoretically analyze Mamba's relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba's capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba's potential for interpretability. Notably, our model also achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.

RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models

TL;DR

The paper addresses understanding Vision Mamba’s representational capacity by situating it between Softmax-based self-attention and linear attention through a rank-based analysis of a unified framework. It demonstrates that Mamba acts as a low-rank approximation of Softmax Attention, while preserving linear scalability via a learnable, data-dependent mask, yielding higher off-diagonal rank than Linear Attention. The authors introduce a Binary-AUC metric to quantify feature-map discriminability and show that DINO self-supervised pretraining produces clearer activation maps, with Mamba achieving strong linear probing performance on ImageNet. Across extensive experiments, Mamba consistently outperforms Linear Attention on long-range visual tasks and approaches Transformer-level capabilities, offering a principled, scalable alternative with improved interpretability for vision architectures.

Abstract

Mamba has recently garnered attention as an effective backbone for vision tasks. However, its underlying mechanism in visual domains remains poorly understood. In this work, we systematically investigate Mamba's representational properties and make three primary contributions. First, we theoretically analyze Mamba's relationship to Softmax and Linear Attention, confirming that it can be viewed as a low-rank approximation of Softmax Attention and thereby bridging the representational gap between Softmax and Linear forms. Second, we introduce a novel binary segmentation metric for activation map evaluation, extending qualitative assessments to a quantitative measure that demonstrates Mamba's capacity to model long-range dependencies. Third, by leveraging DINO for self-supervised pretraining, we obtain clearer activation maps than those produced by standard supervised approaches, highlighting Mamba's potential for interpretability. Notably, our model also achieves a 78.5 percent linear probing accuracy on ImageNet, underscoring its strong performance. We hope this work can provide valuable insights for future investigations of Mamba-based vision architectures.

Paper Structure

This paper contains 33 sections, 2 theorems, 33 equations, 13 figures, 6 tables.

Key Result

Lemma 1

For two matrices $\mathbf{A}$ and $\mathbf{B}$ of the same dimensions, the Hadamard product $\mathbf{A} \circ \mathbf{B}$ (defined element-wise as $(\mathbf{A} \circ \mathbf{B})_{i,j} = \mathbf{A}_{i,j}\mathbf{B}_{i,j}$) satisfies:

Figures (13)

  • Figure 1: Activation maps of Self-Attention attention, Mamba vim, and Linear Attention linearattn. As shown, Self-Attention typically produces high-quality activations with clear foreground-background distinction, while Mamba shows similar patterns but noisier background activations. Linear Attention, in contrast, often struggles to clearly focus on the informative parts of the images.
  • Figure 2: Unified formulation of the lower triangular matrix $\textbf{M}$ with diagonal and off-diagonal block elements. This example is for a sequence length of nine and a chunk size of three.
  • Figure 3: Visualization metric: The process involves encoding (Enc.) an image, resizing (Res.) the segmentation mask to align with the feature map, and calculating the AUC scores.
  • Figure 4: AUC analysis across different settings (values in %): (a) Supervised vs. self-supervised learning and model sizes. (b) Attention head contribution in self/linear attention. (c) Register mechanisms in DINOv2. (d) Register token position effects. (e) Feature quality evolution across layers. (f) AUC comparison: self/linear attention, and Mamba (Base model).
  • Figure 5: Comparison of feature map quality in supervised vs. self-supervised settings. Feature maps from Mamba-Reg, Vim, and ViT models trained with supervised learning (left) and DINO self-supervised learning (right) on two example images. Self-supervised learning produces significantly clearer maps with better foreground-background distinction and reduced noise. ViT achieves the cleanest activations, while Mamba-based models show comparable quality in the self-supervised setting.
  • ...and 8 more figures

Theorems & Definitions (4)

  • Lemma 1: Hadamard Product Rank Bound
  • proof
  • Lemma 2: Rank Bound for Matrix Products
  • proof