Table of Contents
Fetching ...

Exploring the Limitations of Mamba in COPY and CoT Reasoning

Ruifeng Ren, Zhicong Li, Yong Liu

TL;DR

The paper investigates the limitations of the Mamba architecture for long-sequence COPY and Chain-of-Thought reasoning by analyzing its expressive capacity relative to Transformers. It connects Mamba's state-space module to linear attention, showing constant-size Mamba struggles with COPY and DP-style CoT, while linear-size Mamba can COPY but loses per-step savings, and DP tasks incur costs similar to Transformer baselines. The authors prove bounds for COPY feasibility under constant- and variable-size SSM and analyze CoT for DP, highlighting that locality in DP problems can yield efficiency benefits for Mamba. Experimental results on copy and CoT tasks corroborate the theoretical insights, illustrating that Mamba is not universally superior to Transformers but can offer advantages in certain locality-exploiting settings. These findings inform design choices and motivate hybrid architectures that combine Mamba and Transformer strengths for long-context reasoning tasks.

Abstract

Transformers have become the backbone of modern Large Language Models (LLMs); however, their inference overhead grows linearly with the sequence length, posing challenges for modeling long sequences. In light of this, Mamba has attracted attention for maintaining a constant inference size, with empirical evidence demonstrating that it can match Transformer performance in sequence modeling while significantly reducing computational costs. However, an open question remains: can Mamba always bring savings while achieving performance comparable to Transformers? In this paper, we focus on analyzing the expressive ability of Mamba to perform our defined COPY operation and Chain of Thought (CoT) reasoning. First, inspired by the connection between Mamba and linear attention, we show that constant-sized Mamba may struggle to perform COPY operations while Transformers can handle them more easily. However, when the size of Mamba grows linearly with the input sequence length, it can accurately perform COPY, but in this case, Mamba no longer provides overhead savings. Based on this observation, we further analyze Mamba's ability to tackle CoT tasks, which can be described by the Dynamic Programming (DP) problems. Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is still comparable to standard Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our experiments on the copy and CoT tasks further demonstrate Mamba's limitations compared to Transformers in learning these tasks.

Exploring the Limitations of Mamba in COPY and CoT Reasoning

TL;DR

The paper investigates the limitations of the Mamba architecture for long-sequence COPY and Chain-of-Thought reasoning by analyzing its expressive capacity relative to Transformers. It connects Mamba's state-space module to linear attention, showing constant-size Mamba struggles with COPY and DP-style CoT, while linear-size Mamba can COPY but loses per-step savings, and DP tasks incur costs similar to Transformer baselines. The authors prove bounds for COPY feasibility under constant- and variable-size SSM and analyze CoT for DP, highlighting that locality in DP problems can yield efficiency benefits for Mamba. Experimental results on copy and CoT tasks corroborate the theoretical insights, illustrating that Mamba is not universally superior to Transformers but can offer advantages in certain locality-exploiting settings. These findings inform design choices and motivate hybrid architectures that combine Mamba and Transformer strengths for long-context reasoning tasks.

Abstract

Transformers have become the backbone of modern Large Language Models (LLMs); however, their inference overhead grows linearly with the sequence length, posing challenges for modeling long sequences. In light of this, Mamba has attracted attention for maintaining a constant inference size, with empirical evidence demonstrating that it can match Transformer performance in sequence modeling while significantly reducing computational costs. However, an open question remains: can Mamba always bring savings while achieving performance comparable to Transformers? In this paper, we focus on analyzing the expressive ability of Mamba to perform our defined COPY operation and Chain of Thought (CoT) reasoning. First, inspired by the connection between Mamba and linear attention, we show that constant-sized Mamba may struggle to perform COPY operations while Transformers can handle them more easily. However, when the size of Mamba grows linearly with the input sequence length, it can accurately perform COPY, but in this case, Mamba no longer provides overhead savings. Based on this observation, we further analyze Mamba's ability to tackle CoT tasks, which can be described by the Dynamic Programming (DP) problems. Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is still comparable to standard Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our experiments on the copy and CoT tasks further demonstrate Mamba's limitations compared to Transformers in learning these tasks.
Paper Structure (19 sections, 15 theorems, 40 equations, 9 figures)

This paper contains 19 sections, 15 theorems, 40 equations, 9 figures.

Key Result

Theorem 1

Given a SSM module with constant size and the input sequence ${\bm{x}}_1, {\bm{x}}_2, \dots, {\bm{x}}_{N} \in [-M, M]^{d}$ such that Assumption assum:copy holds, then for any $\epsilon > 0$, the SSM module can approximate COPY operation at some position $i$, that is, $\| {\bm{y}}_i - {\bm{o}}_i \|_{

Figures (9)

  • Figure 1: The illustration of the simplified Mamba layer we focus on. Left Part: A Mamba layer can be composed of a Mamba block with the residual connection; The Mamba block uses a gated MLP to control the output of the SSM module, where we call the branch with the SSM module as "the SSM branch" while the other as "the gated branch"; Right Part: The SSM module used in Mamba can be rewritten in a form similar to linear attention, where $\mathbf{\Delta}_i$, ${\bm{b}}_i$, and ${\bm{c}}_i$ in SSM are all derived from the current ${\bm{x}}_i$, similar to ${\bm{v}}_i$, ${\bm{k}}_i$, and ${\bm{q}}_i$ in linear attention respectively.
  • Figure 2: An example for COPY operation and ($L,\delta$)-matching set. We expect the output at position $i$ to be the historical record (value) corresponding to "dog". The historical records belonging to the ($L,\delta$)-matching set are labeled in blue, which are more relevant to the output ${\bm{o}}_i$ based on the attention scores $|{\bm{c}}_i^T{\bm{b}}_j| \ge \delta$.
  • Figure 3: Left: Accuracy during training of models with different sizes on the copy task. Right: The performance of Mamba when the length of the input sequence to be copied is changed.
  • Figure 4: Accuracy during training when the task length $L = 30/70$ and $d = 256$ (TF denotes Transformer).
  • Figure 5: Accuracy of Mamba and Transformer under different task lengths and model sizes.
  • ...and 4 more figures

Theorems & Definitions (31)

  • Definition 1: COPY Operation
  • Definition 2: ($L,\delta$)-Matching Set
  • Theorem 1: Approximate COPY operation with constant-size SSM module
  • Theorem 2: Approximate COPY operation with constant-size attention module
  • Theorem 3: Perform COPY operation with linear-scaling size
  • Theorem 4: Solve DP problems with CoT
  • Theorem 5: Solve $m$-locality DP problems with CoT
  • Theorem 6: Approximate COPY operation with constant-size SSM module
  • proof
  • Definition 3: COPY Operation for the attention module
  • ...and 21 more