Exploring the Limitations of Mamba in COPY and CoT Reasoning
Ruifeng Ren, Zhicong Li, Yong Liu
TL;DR
The paper investigates the limitations of the Mamba architecture for long-sequence COPY and Chain-of-Thought reasoning by analyzing its expressive capacity relative to Transformers. It connects Mamba's state-space module to linear attention, showing constant-size Mamba struggles with COPY and DP-style CoT, while linear-size Mamba can COPY but loses per-step savings, and DP tasks incur costs similar to Transformer baselines. The authors prove bounds for COPY feasibility under constant- and variable-size SSM and analyze CoT for DP, highlighting that locality in DP problems can yield efficiency benefits for Mamba. Experimental results on copy and CoT tasks corroborate the theoretical insights, illustrating that Mamba is not universally superior to Transformers but can offer advantages in certain locality-exploiting settings. These findings inform design choices and motivate hybrid architectures that combine Mamba and Transformer strengths for long-context reasoning tasks.
Abstract
Transformers have become the backbone of modern Large Language Models (LLMs); however, their inference overhead grows linearly with the sequence length, posing challenges for modeling long sequences. In light of this, Mamba has attracted attention for maintaining a constant inference size, with empirical evidence demonstrating that it can match Transformer performance in sequence modeling while significantly reducing computational costs. However, an open question remains: can Mamba always bring savings while achieving performance comparable to Transformers? In this paper, we focus on analyzing the expressive ability of Mamba to perform our defined COPY operation and Chain of Thought (CoT) reasoning. First, inspired by the connection between Mamba and linear attention, we show that constant-sized Mamba may struggle to perform COPY operations while Transformers can handle them more easily. However, when the size of Mamba grows linearly with the input sequence length, it can accurately perform COPY, but in this case, Mamba no longer provides overhead savings. Based on this observation, we further analyze Mamba's ability to tackle CoT tasks, which can be described by the Dynamic Programming (DP) problems. Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is still comparable to standard Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our experiments on the copy and CoT tasks further demonstrate Mamba's limitations compared to Transformers in learning these tasks.
