Table of Contents
Fetching ...

MemMamba: Rethinking Memory Patterns in State Space Model

Youjin Wang, Yangjingyi Chen, Jiahao Yan, Jiaxuan Lu, Xiao Sun

TL;DR

MemMamba addresses the memory bottleneck in ultra-long sequence modeling by diagnosing memory decay in Mamba through an information-theoretic lens and introducing a memory-centric architecture that combines state summarization with sparse cross-layer and cross-token attention. The proposed horizontal–vertical memory fidelity framework (ETMF and ECLMF) quantifies information loss and guides architectural design to mitigate forgetting without sacrificing efficiency. Empirically, MemMamba achieves state-of-the-art memory retention across language modeling, sparse retrieval, and cross-document tasks, delivering strong robustness to context length and a reported 48% inference-speedup over Transformer baselines. The work offers a practical paradigm for memory-aware, linear-complexity sequence models with broad implications for ultra-long context understanding and retrieval-augmented reasoning.

Abstract

With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.

MemMamba: Rethinking Memory Patterns in State Space Model

TL;DR

MemMamba addresses the memory bottleneck in ultra-long sequence modeling by diagnosing memory decay in Mamba through an information-theoretic lens and introducing a memory-centric architecture that combines state summarization with sparse cross-layer and cross-token attention. The proposed horizontal–vertical memory fidelity framework (ETMF and ECLMF) quantifies information loss and guides architectural design to mitigate forgetting without sacrificing efficiency. Empirically, MemMamba achieves state-of-the-art memory retention across language modeling, sparse retrieval, and cross-document tasks, delivering strong robustness to context length and a reported 48% inference-speedup over Transformer baselines. The work offers a practical paradigm for memory-aware, linear-complexity sequence models with broad implications for ultra-long context understanding and retrieval-augmented reasoning.

Abstract

With the explosive growth of data, long-sequence modeling has become increasingly important in tasks such as natural language processing and bioinformatics. However, existing methods face inherent trade-offs between efficiency and memory. Recurrent neural networks suffer from gradient vanishing and explosion, making them hard to scale. Transformers can model global dependencies but are constrained by quadratic complexity. Recently, selective state-space models such as Mamba have demonstrated high efficiency with O(n) time and O(1) recurrent inference, yet their long-range memory decays exponentially. In this work, we conduct mathematical derivations and information-theoretic analysis to systematically uncover the memory decay mechanism of Mamba, answering a fundamental question: what is the nature of Mamba's long-range memory and how does it retain information? To quantify key information loss, we further introduce horizontal-vertical memory fidelity metrics that capture degradation both within and across layers. Inspired by how humans distill and retain salient information when reading long documents, we propose MemMamba, a novel architectural framework that integrates state summarization mechanism together with cross-layer and cross-token attention, which alleviates long-range forgetting while preserving linear complexity. MemMamba achieves significant improvements over existing Mamba variants and Transformers on long-sequence benchmarks such as PG19 and Passkey Retrieval, while delivering a 48% speedup in inference efficiency. Both theoretical analysis and empirical results demonstrate that MemMamba achieves a breakthrough in the complexity-memory trade-off, offering a new paradigm for ultra-long sequence modeling.

Paper Structure

This paper contains 37 sections, 43 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overall workflow of MemMamba. The framework is composed of $n$ stacked MemMamba Block Layers, where each layer preserves critical context via the Note Block and enables long-range interaction through sparse cross-layer attention.
  • Figure 2: Workflow of a MemMamba Block Layer. Each block integrates three components: state space model (SSM) updates, cross-token attention, and periodically triggered cross-layer attention.
  • Figure 3: Ablation results of the core mechanisms. The same hardware conditions and training configurations are used.
  • Figure 4: Comparison of ETMF and ECLMF across different Mamba variants
  • Figure 5: Comparison of perplexity (PPL) across models at different context lengths.
  • ...and 2 more figures