Linking In-context Learning in Transformers to Human Episodic Memory

Li Ji-An; Corey Y. Zhou; Marcus K. Benna; Marcelo G. Mattar

Linking In-context Learning in Transformers to Human Episodic Memory

Li Ji-An, Corey Y. Zhou, Marcus K. Benna, Marcelo G. Mattar

TL;DR

The relationship between interacting attention heads and human episodic memory is examined, demonstrating that induction heads are behaviorally, functionally, and mechanistically similar to the contextual maintenance and retrieval model of human episodic memory.

Abstract

Understanding connections between artificial and biological intelligent systems can reveal fundamental principles of general intelligence. While many artificial intelligence models have a neuroscience counterpart, such connections are largely missing in Transformer models and the self-attention mechanism. Here, we examine the relationship between interacting attention heads and human episodic memory. We focus on induction heads, which contribute to in-context learning in Transformer-based large language models (LLMs). We demonstrate that induction heads are behaviorally, functionally, and mechanistically similar to the contextual maintenance and retrieval (CMR) model of human episodic memory. Our analyses of LLMs pre-trained on extensive text data show that CMR-like heads often emerge in the intermediate and late layers, qualitatively mirroring human memory biases. The ablation of CMR-like heads suggests their causal role in in-context learning. Our findings uncover a parallel between the computational mechanisms of LLMs and human memory, offering valuable insights into both research fields.

Linking In-context Learning in Transformers to Human Episodic Memory

TL;DR

Abstract

Paper Structure (30 sections, 20 equations, 14 figures, 2 tables)

This paper contains 30 sections, 20 equations, 14 figures, 2 tables.

Introduction
Next-token prediction and memory recall
Transformer models and induction heads
Residual stream and interacting heads
Induction heads and their attention patterns
K-composition and Q-composition induction heads
Contextual maintenance and retrieval model (CMR)
CMR in its original form
CMR as an induction head
Experiments
Quantifying the similarity between an induction head and CMR
CMR-like heads develop human-like temporal clustering over training
CMR-like heads are causally relevant for ICL capability
Discussion
Code
...and 15 more sections

Figures (14)

Figure 1: Tasks and model architectures.(a) Next-token prediction task. The ICL of pre-trained LLMs is evaluated on a sequence of repeated random tokens ("… [A][B][C][D] … [A][B][C][D] …"; e.g., [A]=light, [B]=cat, [C]=table, [D]=water) by predicting the next token (e.g., "… [A][B][C][D] … [B]"$\rightarrow$ ?). (b) Human memory recall task. During the study phase, the subject is sequentially presented with a list of words to memorize. During the recall phase, the subject is required to recall the studied words in any order. (c) Transformer architecture, centering on the residual stream. The blue path is the residual stream of the current token, and the grey path represents the residual stream of a past token. $H_1$ and $H_2$ are attention heads. MLP is the multilayer perceptron. (d) Contextual maintenance and retrieval model. The word vector $\mathbf{f}$ is retrieved from the context vector $\mathbf{t}$ via $\mathbf{M}^{\rm TF}$ and the context vector is updated by the word vector via $\mathbf{M}^{\rm FT}$ (see main text for details).
Figure 2: Induction heads in the GPT2-small model.(a) Several heads in GPT2 have a relatively large induction-head matching score. (b) The attention pattern of the L5H1 head, which has the largest induction-head matching score. The diagonal line ("induction stripe") shows the attention from the destination token in the second repeat to the source token in the first repeat. (c) The attention scores of the L5H1 head averaged over all tokens in the designed prompt as a function of the relative position lag (similar to CRP). Error bars show the SEM across tokens.
Figure 3: Comparison of composition mechanisms of induction heads and CMR. All panels correspond to the optimal Q-K match condition ($j=i-1$). See the main text and Tab. \ref{['tab:comparison']} for details. (a) K-composition induction head. The first-layer head's output serves as the Key of the second-layer head. (b) Q-composition induction head. The first-layer head's output serves as the Query of the second-layer head. (c) CMR is similar to a Q-composition induction head, except that the context vector $t_{j-1}$ is first updated by $\mathbf{M}^{\rm FT}$ into $t_{j}$ at position $j$, then directly used at position $j+1$ (equal to $i$ for the optimal match condition; shown by red lines).
Figure 4: The conditional response probability (CRP) as a function of position lags in a human experiment and different parametrization of CMR.(a) CRP of participants (N=171) in the PEERS dataset, reproduced from zhang2023. "Top 10%" refers to participants whose performance was in the top 10th percentile of the population when recall started from the beginning of the list. They have a sharper CRP with a larger forward asymmetry than other subjects. (b) Left, CMR with "sequential chaining" behavior ($\beta_{\rm enc}=\beta_{\rm rec}=1, \gamma_{\rm FT}=0$). The recall has exactly the same order as the study phase without skipping over any word. Right, CMR with moderate updating at both encoding and retrieval, resulting in human-like free recall behavior ($\beta_{\rm enc}=\beta_{\rm rec}=0.7, \gamma_{\rm FT}=0$). Recall is more likely than not to have the same order as during study and sometimes skips words. (c) Same as (b Right) except with $\gamma_{\rm FT}=0.5$ (Left) and $\gamma_{\rm FT}=1$ (Right). For more examples, see \ref{['supp-fig:cmr-crp']}.
Figure 5: CMR distance provides meaningful descriptions for attention heads in GPT2.(a-c) Average attention scores and the CMR-fitted attention scores of example induction heads (with a non-zero induction-head matching score and positive copying score). (d) Average attention scores and the CMR-fitted attention scores of a duplicate token head wang2022interpretability that is traditionally not considered an induction head but can be well-captured by the CMR. (e) (Top) CMR distance (measured by MSE) and the induction-head matching score for each head. (Bottom) Histogram of the CMR distance.
...and 9 more figures

Linking In-context Learning in Transformers to Human Episodic Memory

TL;DR

Abstract

Linking In-context Learning in Transformers to Human Episodic Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (14)