Table of Contents
Fetching ...

Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers

MohammadReza Ebrahimi, Sunny Panchal, Roland Memisevic

TL;DR

The paper investigates why Transformer-based LLMs fail to generalize to longer sequences in algorithmic tasks like parity and addition. It shows that random memory access within the context is crucial for length generalization, and that standard natural-language pretraining favors content-based addressing over true index-based retrieval. Through interleaved scratchpads and mnemonics, the authors demonstrate that length-generalizable algorithms can be learned when a model can effectively locate the correct memory cell, either directly or via anchors, and they provide attention-map evidence to support this claim. Extending the study to multi-digit addition, the work confirms that mnemonic-based indexing can enable correct, length-generalizable arithmetic, highlighting the potential importance of explicit index-based addressing mechanisms or external memory for robust algorithmic reasoning in LLMs. The findings have practical implications for designing architectures and training regimes that better support long-context reasoning and generalization to unseen problem sizes.

Abstract

Despite their recent successes, Transformer-based large language models show surprising failure modes. A well-known example of such failure modes is their inability to length-generalize: solving problem instances at inference time that are longer than those seen during training. In this work, we further explore the root cause of this failure by performing a detailed analysis of model behaviors on the simple parity task. Our analysis suggests that length generalization failures are intricately related to a model's inability to perform random memory accesses within its context window. We present supporting evidence for this hypothesis by demonstrating the effectiveness of methodologies that circumvent the need for indexing or that enable random token access indirectly, through content-based addressing. We further show where and how the failure to perform random memory access manifests through attention map visualizations.

Your Context Is Not an Array: Unveiling Random Access Limitations in Transformers

TL;DR

The paper investigates why Transformer-based LLMs fail to generalize to longer sequences in algorithmic tasks like parity and addition. It shows that random memory access within the context is crucial for length generalization, and that standard natural-language pretraining favors content-based addressing over true index-based retrieval. Through interleaved scratchpads and mnemonics, the authors demonstrate that length-generalizable algorithms can be learned when a model can effectively locate the correct memory cell, either directly or via anchors, and they provide attention-map evidence to support this claim. Extending the study to multi-digit addition, the work confirms that mnemonic-based indexing can enable correct, length-generalizable arithmetic, highlighting the potential importance of explicit index-based addressing mechanisms or external memory for robust algorithmic reasoning in LLMs. The findings have practical implications for designing architectures and training regimes that better support long-context reasoning and generalization to unseen problem sizes.

Abstract

Despite their recent successes, Transformer-based large language models show surprising failure modes. A well-known example of such failure modes is their inability to length-generalize: solving problem instances at inference time that are longer than those seen during training. In this work, we further explore the root cause of this failure by performing a detailed analysis of model behaviors on the simple parity task. Our analysis suggests that length generalization failures are intricately related to a model's inability to perform random memory accesses within its context window. We present supporting evidence for this hypothesis by demonstrating the effectiveness of methodologies that circumvent the need for indexing or that enable random token access indirectly, through content-based addressing. We further show where and how the failure to perform random memory access manifests through attention map visualizations.
Paper Structure (19 sections, 12 figures)

This paper contains 19 sections, 12 figures.

Figures (12)

  • Figure 1: Top: Prediction in natural language tasks. To predict the pronoun him, the model needs to access previously used pronouns in the context, among other tokens, regardless of the exact position of the token He in the context (content-based addressing). Bottom: Prediction in an arithmetic task. The model returns the running parity of the binary sequence after ===. For the third output, the model must precisely attend to the token in position 3 of the context window (index-based addressing).
  • Figure 2: Length generalization performance of fine-tuned BLOOMZ-560M models on sequences of length 10 to 20 bits, using standard and interleaved scratchpad formats, as well as without a scratchpad.
  • Figure 3: Length generalization performance of fine-tuned BLOOMZ-560M models with and without using mnemonics in the scratchpad.
  • Figure 4: BLOOMZ-560M models trained from random initialization on the parity task using twice the number of epochs.
  • Figure 5: Input attribution visualized through the gradient$\times$input method during performing the parity task. Models were trained on sequences of 10 to 20 bits while predicting the parity of a 40-bit sequence, shown with (right) and without (left) mnemonics. Columns represent output tokens (after ===) and rows represent all tokens in the context window. Observe the scrambled attention pattern in the left figure, after the 20th output.
  • ...and 7 more figures