Table of Contents
Fetching ...

Characterizing Mamba's Selective Memory using Auto-Encoders

Tamanna Hossain, Robert L. Logan, Ganesh Jagadeesan, Sameer Singh, Joel Tetreault, Alejandro Jaimes

TL;DR

The paper develops an auto-encoder probe to quantify selective memory in Mamba, a state-space LM, by reconstructing inputs from frozen hidden states and comparing them to originals using token-level omission rates and sequence-level ROUGE F1. Across 130M–1.4B models and sequences up to 256 tokens, it reveals that math-related content (numbers, variables), organizational named entities, and non-standard dialects are disproportionately forgotten, with memory strength tied to pretraining token frequencies. The study provides a general methodological framework for diagnosing memory limitations in fixed-size hidden-state LMs and highlights practical directions—such as tokenization and pretraining adjustments—to improve retention of critical information. These insights have implications for tasks requiring precise reasoning and information recall, and the authors release their implementation for future work.

Abstract

State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM's hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M--1.4B) on sequences ranging from 4--256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba's pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba's ability to retain important information.

Characterizing Mamba's Selective Memory using Auto-Encoders

TL;DR

The paper develops an auto-encoder probe to quantify selective memory in Mamba, a state-space LM, by reconstructing inputs from frozen hidden states and comparing them to originals using token-level omission rates and sequence-level ROUGE F1. Across 130M–1.4B models and sequences up to 256 tokens, it reveals that math-related content (numbers, variables), organizational named entities, and non-standard dialects are disproportionately forgotten, with memory strength tied to pretraining token frequencies. The study provides a general methodological framework for diagnosing memory limitations in fixed-size hidden-state LMs and highlights practical directions—such as tokenization and pretraining adjustments—to improve retention of critical information. These insights have implications for tasks requiring precise reasoning and information recall, and the authors release their implementation for future work.

Abstract

State space models (SSMs) are a promising alternative to transformers for language modeling because they use fixed memory during inference. However, this fixed memory usage requires some information loss in the hidden state when processing long sequences. While prior work has studied the sequence length at which this information loss occurs, it does not characterize the types of information SSM language models (LMs) tend to forget. In this paper, we address this knowledge gap by identifying the types of tokens (e.g., parts of speech, named entities) and sequences (e.g., code, math problems) that are more frequently forgotten by SSM LMs. We achieve this by training an auto-encoder to reconstruct sequences from the SSM's hidden state, and measure information loss by comparing inputs with their reconstructions. We perform experiments using the Mamba family of SSM LMs (130M--1.4B) on sequences ranging from 4--256 tokens. Our results show significantly higher rates of information loss on math-related tokens (e.g., numbers, variables), mentions of organization entities, and alternative dialects to Standard American English. We then examine the frequency that these tokens appear in Mamba's pretraining data and find that less prevalent tokens tend to be the ones Mamba is most likely to forget. By identifying these patterns, our work provides clear direction for future research to develop methods that better control Mamba's ability to retain important information.

Paper Structure

This paper contains 42 sections, 3 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Identifying Selective Memory Loss. We train auto-encoders to reconstruct inputs directly from Mamba’s hidden states, and measure information loss by comparing inputs to their reconstructions. These reconstructions act as probes of the hidden state's information retention: more faithful reconstruction implies greater information retention.
  • Figure 2: Reconstruction Performance Across Sequence Lengths. ROUGE F1-score for reconstructing text using Mamba (130M). Performance declines sharply as sequence length increases.
  • Figure 3: Performance vs. Model Size. ROUGE F1-score as a function of model size, broken down by sequence length. Shorter sequences achieve high performance with fewer parameters, while longer sequences benefit significantly from increased model capacity.
  • Figure 4: Error Positions. We count the number of reconstruction errors per position in a sequence to assess whether token position impacts reconstruction accuracy. We find that later positions have more reconstruction errors than earlier positions. 130M model errors by position are shown here.
  • Figure 5: Part-of-Speech. Omission rates by PoS type on CoNLL-2003. Numbers have the highest omission rate across model sizes.
  • ...and 5 more figures