Table of Contents
Fetching ...

Emergence of Episodic Memory in Transformers: Characterizing Changes in Temporal Structure of Attention Scores During Training

Deven Mahesh Mistry, Anooshka Bajaj, Yash Aggarwal, Sahaj Singh Maini, Zoran Tiganj

TL;DR

The paper probes how transformer networks organize temporal information during in-context learning by applying cognitive science metrics—lag-CRP analysis, induction matching scores, and controlled ablations—to GPT-2 small and medium trained on WikiText-103 and FineWeb datasets. It reveals episodic-memory-like temporal biases in attention, including primacy, recency, and contiguity, with contiguity predominantly driven by induction heads that enable in-context sequence recall; these effects weaken when induction heads are ablated. Time constants governing temporal retrieval are typically short, clustering around 2–4 tokens, and the magnitude of positional encodings modulates the strength and shape of these effects. Collectively, the work provides a quantitative, cross-disciplinary view of how temporal context is organized during in-context learning in transformers and highlights the role of induction heads in shaping downstream recall behavior.

Abstract

We investigate in-context temporal biases in attention heads and transformer outputs. Using cognitive science methodologies, we analyze attention scores and outputs of the GPT-2 models of varying sizes. Across attention heads, we observe effects characteristic of human episodic memory, including temporal contiguity, primacy and recency. Transformer outputs demonstrate a tendency toward in-context serial recall. Importantly, this effect is eliminated after the ablation of the induction heads, which are the driving force behind the contiguity effect. Our findings offer insights into how transformers organize information temporally during in-context learning, shedding light on their similarities and differences with human memory and learning.

Emergence of Episodic Memory in Transformers: Characterizing Changes in Temporal Structure of Attention Scores During Training

TL;DR

The paper probes how transformer networks organize temporal information during in-context learning by applying cognitive science metrics—lag-CRP analysis, induction matching scores, and controlled ablations—to GPT-2 small and medium trained on WikiText-103 and FineWeb datasets. It reveals episodic-memory-like temporal biases in attention, including primacy, recency, and contiguity, with contiguity predominantly driven by induction heads that enable in-context sequence recall; these effects weaken when induction heads are ablated. Time constants governing temporal retrieval are typically short, clustering around 2–4 tokens, and the magnitude of positional encodings modulates the strength and shape of these effects. Collectively, the work provides a quantitative, cross-disciplinary view of how temporal context is organized during in-context learning in transformers and highlights the role of induction heads in shaping downstream recall behavior.

Abstract

We investigate in-context temporal biases in attention heads and transformer outputs. Using cognitive science methodologies, we analyze attention scores and outputs of the GPT-2 models of varying sizes. Across attention heads, we observe effects characteristic of human episodic memory, including temporal contiguity, primacy and recency. Transformer outputs demonstrate a tendency toward in-context serial recall. Importantly, this effect is eliminated after the ablation of the induction heads, which are the driving force behind the contiguity effect. Our findings offer insights into how transformers organize information temporally during in-context learning, shedding light on their similarities and differences with human memory and learning.

Paper Structure

This paper contains 15 sections, 2 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Attention scores as a function of lag for all attention heads in GPT-2 small after 4000 iterations on the Wikitext-103 dataset (baseline positional encoding).
  • Figure 2: Two induction heads before (top row) and after (middle row) adjusting for recency effect. The bottom row shows zoomed-in version of the middle row.
  • Figure 3: Example of the same induction head L7H3 during different stages of training. A. After 300 iterations. B. After 4000 iterations.
  • Figure 4: Induction scores for five checkpoints throughout the training of GPT-2 small on Wikitext-103 dataset. A. Random initialization. B. 1000 iterations. C. 2000 iterations. D. 3000 iterations. E. 4000 iterations.
  • Figure 5: Correlation in positional encoding vectors scales with training iterations and positional encoding magnitude during training of GPT-2 small on Wikitext-103 dataset.
  • ...and 15 more figures