Table of Contents
Fetching ...

Stuffed Mamba: Oversized States Lead to the Inability to Forget

Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun

TL;DR

The paper investigates why Mamba-style RNNs struggle to process contexts longer than their training length. It identifies an inability to forget as the core culprit, arising from state overparameterization, where a large fixed-state memory can model language well without learning proper forgetting. Through controlled experiments across language modeling and a passkey retrieval task, the authors show a linear relationship between the state size and the training length needed to learn forgetting, and an exponential growth of the maximum recall context length with state size. They propose methods to induce forgetting (RRI) and demonstrate that forgetting dynamics critically influence long-context performance, offering practical guidance for designing future long-context RNNs and understanding memory-interference phenomena. The findings highlight a fundamental trade-off between state capacity, training length, and forgetting mechanisms in robust long-context modeling.

Abstract

Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to "forget" earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.

Stuffed Mamba: Oversized States Lead to the Inability to Forget

TL;DR

The paper investigates why Mamba-style RNNs struggle to process contexts longer than their training length. It identifies an inability to forget as the core culprit, arising from state overparameterization, where a large fixed-state memory can model language well without learning proper forgetting. Through controlled experiments across language modeling and a passkey retrieval task, the authors show a linear relationship between the state size and the training length needed to learn forgetting, and an exponential growth of the maximum recall context length with state size. They propose methods to induce forgetting (RRI) and demonstrate that forgetting dynamics critically influence long-context performance, offering practical guidance for designing future long-context RNNs and understanding memory-interference phenomena. The findings highlight a fundamental trade-off between state capacity, training length, and forgetting mechanisms in robust long-context modeling.

Abstract

Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to "forget" earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.

Paper Structure

This paper contains 46 sections, 9 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: The LM loss of Mamba-2 as a function of token position. The training length is 8K.
  • Figure 2: The accuracy of Mamba-2 on the passkey retrieval task. "Ans. Depth" refers to the passkey position divided by the context length.
  • Figure 3: The retention strength of the first token ($\alpha_{1:t}$) over time. Each curve represents a head.
  • Figure 4: LM loss of Mamba-2 370M at different positions when inducing forgetting (see Section \ref{['sec:inducing-forgetting']}).
  • Figure 5: The mean and variance of the first 8 heads in layer 38 of Mamba-2 370M. It exhibits a clear explosion when $t$ is greater than the training length.
  • ...and 13 more figures