Table of Contents
Fetching ...

LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement

Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin

TL;DR

LongMamba introduces a training-free method to extend Mamba-style state-space models’ long-context understanding by differentiating hidden-state channels into local and global families and enlarging the receptive fields of global channels via token filtering. The approach targets the exponential hidden-state decay that hampers global-channel memory when input length exceeds the training sequence, using a two-step process: (1) classify channels by cumulative decay with a threshold $\theta$, and (2) filter tokens for global channels according to a per-length threshold $g(S)$ to align long-context behavior with training conditions. Extensive experiments on PG-19, RULER, and LongBench-E show significant improvements over vanilla Mamba and DeciMamba, with perplexity and accuracy boosts and only modest latency overhead. This work advances efficient long-context modeling by offering a practical, training-free mechanism to extend the operational range of Mamba and related SSMs, narrowing the gap with Transformer-based models in long-context tasks.

Abstract

State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.

LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement

TL;DR

LongMamba introduces a training-free method to extend Mamba-style state-space models’ long-context understanding by differentiating hidden-state channels into local and global families and enlarging the receptive fields of global channels via token filtering. The approach targets the exponential hidden-state decay that hampers global-channel memory when input length exceeds the training sequence, using a two-step process: (1) classify channels by cumulative decay with a threshold , and (2) filter tokens for global channels according to a per-length threshold to align long-context behavior with training conditions. Extensive experiments on PG-19, RULER, and LongBench-E show significant improvements over vanilla Mamba and DeciMamba, with perplexity and accuracy boosts and only modest latency overhead. This work advances efficient long-context modeling by offering a practical, training-free mechanism to extend the operational range of Mamba and related SSMs, narrowing the gap with Transformer-based models in long-context tasks.

Abstract

State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.

Paper Structure

This paper contains 20 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Visualization of the Mamba-130M model's attention map (log scale) under (a) training sequence length (2,000 tokens) and (b) extended sequence length (16,000 tokens). We uniformly sample five hidden state channels in the 12-th layer of the Mamba model and select a sequence from the Pile gao2020pile dataset (i.e., Mamba's training dataset) as the input. The red lines delineate the receptive field of each channel, showing the range of tokens that significantly influence the current token's output. We select three channels ((i)-(iii)) with local receptive fields and two channels ((iv)-(v)) with global receptive fields to illustrate the distinct patterns of information processing in Mamba.
  • Figure 2: (a) Visualize the issue of directly applying Mamba models to a sequence (sequence length denoted as $S$) longer than the training sequence length (denoted as $L$); (b) Visualize the proposed LongMamba framework, where we enlarge the receptive fields of the global channels using the two-step pipeline detailed in Sec. \ref{['sec:method:identify']} and Sec. \ref{['sec:method:align']} .
  • Figure 3: Perplexity on the PG-19 dataset under varying sequence lengths. We evaluate three models: Mamba-1.4B, Mamba2-1.3B, and Zamba2-1.2B. The Mamba-1.4B and Mamba2-1.3B models are trained on sequences of 2k tokens, while the Zamba2-1.2B model is trained on sequences of 4k tokens. For all three models, we measure perplexity on PG-19 sequences of up to 60k tokens. In the figure, "Vanilla" refers to the baseline models without applying DeciMamba or LongMamba.
  • Figure 4: Visualization of the cumulative hidden state decay (defined in Eq. \ref{['eq:decay']}) as the number of tokens processed by the model increases. In the figure, we visualize 12 global channels sampled from the 16th layer of the Mamba-130M model. We use a sequence sampled from the Pile datasetgao2020pile as input and plot the hidden state decay of each global channel in a unique color.
  • Figure 5: Visualization of the attention maps (log scale) for a sequence composed of 2,000 tokens. In this figure, we sample a sequence from the Pile gao2020pile dataset as input and visualize 48 channels randomly sampled from the Mamba-130M model. The channels are sorted by their cumulative decay on the sampled sequence. The red lines delineate the receptive field of each channel, covering all attention scores greater than $10^{-3}$.
  • ...and 1 more figures