Table of Contents
Fetching ...

MambaOut: Do We Really Need Mamba for Vision?

Weihao Yu, Xinchao Wang

TL;DR

Problem: whether Mamba's SSM-based token mixer is necessary for vision tasks. Approach: analyze memory mechanisms and token-mixing modes, classify visual tasks by long-sequence and autoregressive properties, and test the idea by constructing MambaOut—Gated CNN blocks without SSM—evaluated on ImageNet, COCO, and ADE20K. Findings: ImageNet classification benefits from MambaOut, supporting that SSM is unnecessary for that task, while long-sequence detection/segmentation tasks show mixed results and still allow benefits from SSM-based Mamba models; the results validate the proposed hypotheses. Implications: MambaOut provides a simple, strong baseline for vision tasks, clarifies when Mamba is advantageous, and invites future exploration of Mamba's role in long-sequence vision tasks and LLM/LMM architectures.

Abstract

Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at https://github.com/yuweihao/MambaOut

MambaOut: Do We Really Need Mamba for Vision?

TL;DR

Problem: whether Mamba's SSM-based token mixer is necessary for vision tasks. Approach: analyze memory mechanisms and token-mixing modes, classify visual tasks by long-sequence and autoregressive properties, and test the idea by constructing MambaOut—Gated CNN blocks without SSM—evaluated on ImageNet, COCO, and ADE20K. Findings: ImageNet classification benefits from MambaOut, supporting that SSM is unnecessary for that task, while long-sequence detection/segmentation tasks show mixed results and still allow benefits from SSM-based Mamba models; the results validate the proposed hypotheses. Implications: MambaOut provides a simple, strong baseline for vision tasks, clarifies when Mamba is advantageous, and invites future exploration of Mamba's role in long-sequence vision tasks and LLM/LMM architectures.

Abstract

Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at https://github.com/yuweihao/MambaOut
Paper Structure (14 sections, 8 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 8 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: (a) Architecture of Gated CNN dauphin2017language and Mamba gu2023mamba blocks (omitting Normalization and shortcut). The Mamba block extends the Gated CNN with an additional state space model (SSM). As will be conceptually discussed in Section \ref{['sec:conceptual_discussion']}, SSM is not necessary for image classification on ImageNet deng2009imagenetrussakovsky2015imagenet. To empirically verify this claim, we stack Gated CNN blocks to build a series of models named MambaOut. (b) MambaOut outperforms visual Mamba models, e.g., Vision Mamhba zhu2024vision, VMamba liu2024vmamba and PlainMamba yang2024plainmamba, on ImageNet image classification.
  • Figure 2: The mechanism illustration of causal attention and RNN-like models from memory perspective, where $x_i$ denotes the input token of $i$-th step. (a) Causal attention stores all previous tokens' keys $k$ and values $v$ as memory. The memory is updated by continuously adding the current token's key and value, so the memory is lossless, but the downside is that the computational complexity of integrating old memory and current tokens increases as the sequence lengthens. Therefore, attention can effectively manage short sequences but may encounter difficulties with longer ones. (b) In contrast, RNN-like models compress previous tokens into fixed-size hidden state $h$, which serves as the memory. This fixed size means that RNN memory is inherently lossy, which cannot directly compete with the lossless memory capacity of attention models. Nonetheless, RNN-like models can demonstrate distinct advantages in processing long sequences, as the complexity of merging old memory with current input remains constant, regardless of sequence length.
  • Figure 3: (a) Two modes of token mixing raffel2020exploring. For a total of $T$ tokens, the fully-visible mode allows token $t$ to aggregate inputs from all tokens, i.e., $\{xi\}_{i=1}^{T}$, to compute its output $y_t$. In contrast, the causal mode restricts token $t$ to only aggregate inputs from preceding and current tokens $\{x_i\}_{i=1}^{t}$. By default, attention operates in fully-visible mode but can be adjusted to causal mode with causal attention masks. RNN-like models, such as Mamba's SSM gu2023mambagu2021efficiently, inherently operate in causal mode due to their recurrent nature. (b) We modify the ViT's attention dosovitskiy2020imagetouvron2021training from fully-visible to causal mode and observe performance drop on ImageNet, which indicates causal mixing is unnecessary for understanding tasks.
  • Figure 4: (a) The overall framework of MambaOut for visual recognition. Similar to ResNet he2016deep, MambaOut adopts hierarchical architecture with four stages. $D_i$ represents the channel dimensions at the $i$-th stage. (b) The architecture of Gated CNN block. The difference between the Gated CNN block dauphin2017language and the Mamba block gu2023mamba lies in the absence of the SSM (state space model) in the Gated CNN block.