Demystify Mamba in Vision: A Linear Attention Perspective

Dongchen Han; Ziyi Wang; Zhuofan Xia; Yizeng Han; Yifan Pu; Chunjiang Ge; Jun Song; Shiji Song; Bo Zheng; Gao Huang

Demystify Mamba in Vision: A Linear Attention Perspective

Dongchen Han, Ziyi Wang, Zhuofan Xia, Yizeng Han, Yifan Pu, Chunjiang Ge, Jun Song, Shiji Song, Bo Zheng, Gao Huang

TL;DR

The paper reveals a close relationship between Mamba and linear attention Transformer by reformulating both within a unified framework and identifying six key design differences. It shows that the forget gate and a modified block design largely drive Mamba's superior performance, while other distinctions offer limited gains or hinder efficiency. By substituting these effective components into linear attention, the authors propose MILA, a Mamba-inspired linear attention model that achieves state-of-the-art results on vision benchmarks with parallelizable computation and faster inference. Empirical results across ImageNet, COCO, and ADE20K demonstrate MILA's superior accuracy-speed trade-offs over existing vision Mamba models, validating the practical value of the design insights.

Abstract

Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba's success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba's success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https://github.com/LeapLabTHU/MLLA.

Demystify Mamba in Vision: A Linear Attention Perspective

TL;DR

Abstract

Paper Structure (19 sections, 13 equations, 7 figures, 8 tables)

This paper contains 19 sections, 13 equations, 7 figures, 8 tables.

Introduction
Related Works
Preliminaries
Attention Mechanism
Selective State Space Model
Connecting Mamba and Linear Attention Transformer
Interpreting Selective State Space Model as Linear Attention
Analysis of Differences in Core Operations
Analysis of Macro Architecture Design
Relationship between Mamba and Linear Attention Transformer
Empirical Study
Implementation
Empirical Analysis of the Differences
Comparison with Mamba in Vision
Conclusion
...and 4 more sections

Figures (7)

Figure 1: Illustration of selective SSM in Mamba (\ref{['eq:compare_ssm_2']}) and single head linear attention (\ref{['eq:compare_linear_2']}). It can be seen that selective SSM resembles single-head linear attention with additional input gate $\mathbf{\Delta}_i$, forget gate $\widetilde{{\bm{A}}}_i$ and shortcut ${\bm{D}}\odot {\bm{x}}_i$, while omitting normalization ${\bm{Q}}_i{\bm{Z}}_i$.
Figure 2: Illustration of selective state space model (\ref{['eq:selective_ssm']}) and its equivalent form (\ref{['eq:modified_selective_ssm']}).
Figure 3: Illustration of the macro designs of linear attention Transformer, Mamba and our MILA.
Figure 4: (a) Visualizations of the distributions of input gate values. (b) The average of forget gate values in different layers. (c) The attenuation effect of different forget gate values.
Figure 5: The standard deviation of token lengths.
...and 2 more figures

Demystify Mamba in Vision: A Linear Attention Perspective

TL;DR

Abstract

Demystify Mamba in Vision: A Linear Attention Perspective

Authors

TL;DR

Abstract

Table of Contents

Figures (7)