Table of Contents
Fetching ...

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

Yoav Gur-Arieh, Mor Geva, Atticus Geiger

TL;DR

This study establishes a more complete picture of how LMs bind and retrieve entities in-context and develops a causal model combining all three mechanisms that estimates next token distributions with 95% agreement.

Abstract

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

TL;DR

This study establishes a more complete picture of how LMs bind and retrieve entities in-context and develops a causal model combining all three mechanisms that estimates next token distributions with 95% agreement.

Abstract

A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent "Ann loves pie" by binding "Ann" to "pie", allowing it to later retrieve "Ann" when asked "Who loves pie?" Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where "Ann" is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving "Ann" using its bound counterpart "pie") and a reflexive mechanism (retrieving "Ann" through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

Paper Structure

This paper contains 38 sections, 4 equations, 28 figures, 6 tables.

Figures (28)

  • Figure 1: An illustration of the three mechanisms for retrieving bound entities in-context. We find that as models process inputs with groups of entities: (A) binding information of three types---positional, lexical, reflexive---is encoded in the entity tokens of each group, (B) this binding information is jointly used to retrieve entities in-context, and (C) it is possible to separate the three binding signals with counterfactual patching. The counterfactual input is designed such that patching activations to the LM run on the original input results in the positional, lexical, and reflexive mechanisms predicting different entities (See §\ref{['sec:counterfactual']}).
  • Figure 2: Results from interchange interventions on gemma-2-2b-it over a counterfactual dataset with three entities per group ($m=3$) (See Figure \ref{['fig:intro']} and §\ref{['sec:counterfactual']}). Outputs predicted by the positional, lexical and reflexive mechanisms are shown in dark blue, green and orange. Left: Distribution of effects for three representative entity group indices (first, middle, and last) with $t_{\text{entity}}{}=3$. At layers 16–18, the last token position carries binding information used for retrieval. Right: Distribution of effects for all indices at layer 18 for $t_{\text{entity}}{} \in \{1,2,3\}$, i.e., the question can be about any of the three entities in each clause. A U-shaped curve emerges: first and last indices rely more on the positional mechanism, while middle indices rely more on the lexical and reflexive mechanisms. See §\ref{['appx:more_exps']} for replication across models and tasks, and Figure \ref{['fig:u_plot_other_axes']} for plots using the original prompt as the x-axis.
  • Figure 3: The positional mechanism is diffuse for middle entity groups. Left: Confusion matrix (%) of the patched positional index vs. gemma-2-2b-it’s prediction after an interchange intervention (as in Figure \ref{['fig:intro']}). Counterfactual predictions cluster near the position promoted by the positional mechanism, decaying with distance. Only the mixed and positional patch effects from Figure \ref{['fig:u_plot']} are shown; see Figure \ref{['fig:gen_conf']} for other models and tasks. Right: Mean logit distributions with $i_P=6,i_R=14$, and $i_L$ varied, illustrating interaction between the three mechanisms. The lexical and reflexive signals form one-hot peaks, while the positional is broader and more diffuse. These mechanisms also show additive and suppressive effects. See Figures \ref{['fig:more_dists']}, \ref{['fig:more_dists_qwen1']}, and \ref{['fig:more_dists_qwen2']} for more distributions.
  • Figure 4: Results for training our full model $\mathcal{M}\text{ }$$({L}_{\text{one-hot}}$, ${R}_{\text{one-hot}}$,${P}_{\text{Gauss}})$, in addition to variants, baselines and ablations. Left: JSS scores for modeling the LM next token distribution over $i_P,i_L,i_R$. Evaluated on gemma-2-2b-it for the music binding task, with $t_e=t_{\text{entity}}$. Our model attains near-perfect JSS, slightly below the oracle. KL values (Table \ref{['tab:modeling_results_kl']}) show the same trend. All CIs are $<0.02$; for $\mathcal{M}$ and $\mathcal{M}$ w/ oracle they are $<0.002$. Right: Learned weights $w_{\text{lex}},w_{\text{ref}},w_{\text{pos}}$ and $\sigma$ curve, for $t_{\text{entity}}{}=2$. Observe $\sigma$ widens for middle indices and narrows toward the end.
  • Figure 5: Padding results for gemma-2-2b-it on the boxes task. Left: Confusion matrix between the model's predicted index and the positional index patched in from the counterfactual. This gets increasingly fuzzy for early tokens as padding is increased. Right: Distribution of effects as padding is increased, showing the positional mechanism strengthens at the expense of the lexical mechanism.
  • ...and 23 more figures