Table of Contents
Fetching ...

DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, Raja Giryes

TL;DR

This paper tackles the challenge of long-context processing by examining Mamba's limited length extrapolation arising from an empirically constrained receptive field. It introduces DeciMamba, a Delta_t-guided decimation mechanism that expands the effective context without retraining, implemented as a context-extension layer atop Mamba's S6. Across long-context benchmarks (LongBench, passkey retrieval) and PG-19 perplexity, DeciMamba achieves substantial extrapolation gains and faster inference, with ablations supporting the design choices. The work provides a path toward efficient, long-range processing in sub-quadratic architectures and discusses potential extensions to other models and directions for future improvements.

Abstract

Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are significantly longer than the ones seen during training, while enjoying faster inference.

DeciMamba: Exploring the Length Extrapolation Potential of Mamba

TL;DR

This paper tackles the challenge of long-context processing by examining Mamba's limited length extrapolation arising from an empirically constrained receptive field. It introduces DeciMamba, a Delta_t-guided decimation mechanism that expands the effective context without retraining, implemented as a context-extension layer atop Mamba's S6. Across long-context benchmarks (LongBench, passkey retrieval) and PG-19 perplexity, DeciMamba achieves substantial extrapolation gains and faster inference, with ablations supporting the design choices. The work provides a path toward efficient, long-range processing in sub-quadratic architectures and discusses potential extensions to other models and directions for future improvements.

Abstract

Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are significantly longer than the ones seen during training, while enjoying faster inference.
Paper Structure (28 sections, 11 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 11 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Improving Mamba Extrapolation with DeciMamba. We present a novel decimation mechanism tailored for Mamba. With our method Mamba is able to process sequences that are much longer than the ones trained on while enjoying reduced inference costs. (Left) Zero-shot evaluation of an instruction-tuned 2.8B Mamba model on a subset of LongBench tasks. We show both LongBench and LongBench_e (three length groups: 0-4K, 4-8K, 8k+). MFQA, GovR, MN, TrQA, LCC, RepB, and Avg stand for MultiFieldQA, GovReport, MultiNews, TriviaQA, LCC, RepoBench-p, and Average; (Right-Top) Passkey Retrieval for Mamba-130m (Right-Bottom) Same for DeciMamba-130m. All models (130m, 2.8b) were trained on lengths of 2K tokens.
  • Figure 2: DeciMamba(Left) Schematic overview; (Middle) By carefully inspecting the recurrent view of Mamba, we revealed the implicit filtering mechanism embedded in the recurrent gate and controlled by $\Delta_t$; (Right) Visualization of a DeciMamba model. The grey lines represent the sequence length at the input and output of each layer. Layers with a large ERF are decimated.
  • Figure 3: Detecting and Quantifying Limited ERFs. (Center, Left) Recordings of Mamba Attention Matrices with and without extrapolation (Mamba-130m, layer 17, trained on seq. lengths of 2k). Mamba unintentionally learns a limited ERF during training (highlighted by the dashed rectangle) which disrupts its extrapolation abilities. (Right) Quantifying Mamba's Information Loss by Measuring $\sum_{k=2}^{L}\Delta_k$ Divergence. To show Mamba's sensitivity to increasing context lengths we measure the first occasion of information loss, as described in Sec. \ref{['sec:IdentifyTheProblem']}. We observe that in the most semantic layers (16 and 17, see Passkey Retrieval in Sec. \ref{['sec:passkey']}) $\sum_{k=2}^{L}\Delta_k$ diverges exponentially fast, causing a fast decay in the attention values, leading to limited ERFs like in the center image.
  • Figure 4: Mamba Mean Distance. Each panel contains an attention matrix along with its corresponding 'Mamba Mean Distance', depicted by the horizontal distance between the red diagonal line and the main diagonal. The matrices are extracted from a pre-trained model of size 2.8B, trained on the Pile dataset gao2020pile.
  • Figure 5: Mamba Mean Distance Quantifies Effective Context Utilization. (Left) Mamba Mean Distance as a function of the context length during inference, for various training context lengths. (Right) Same, but normalized by the training context length.
  • ...and 7 more figures