Table of Contents
Fetching ...

Language Models use Lookbacks to Track Beliefs

Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger

TL;DR

The paper probes how language models represent and update characters' beliefs, introducing the CausalToM dataset to enable counterfactual causal analysis of Theory of Mind in LMs. It uncovers a robust Lookback mechanism comprising Binding, Answer, and Visibility lookbacks, which bind character-object-state triples, dereference state pointers, and incorporate observed actions into beliefs via a QK-circuit in residual streams. Through interchange interventions and causal abstraction methods, the authors map these components to specific LM subspaces and layers, providing mechanistic evidence for belief tracking that extends to visibility scenarios. The findings generalize across multiple LMs and relate to the BigToM benchmark, advancing our understanding of the internal computations enabling nontrivial belief reasoning in transformers and guiding future interpretability and safety research.

Abstract

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

Language Models use Lookbacks to Track Beliefs

TL;DR

The paper probes how language models represent and update characters' beliefs, introducing the CausalToM dataset to enable counterfactual causal analysis of Theory of Mind in LMs. It uncovers a robust Lookback mechanism comprising Binding, Answer, and Visibility lookbacks, which bind character-object-state triples, dereference state pointers, and incorporate observed actions into beliefs via a QK-circuit in residual streams. Through interchange interventions and causal abstraction methods, the authors map these components to specific LM subspaces and layers, providing mechanistic evidence for belief tracking that extends to visibility scenarios. The findings generalize across multiple LMs and relate to the BigToM benchmark, advancing our understanding of the internal computations enabling nontrivial belief reasoning in transformers and guiding future interpretability and safety research.

Abstract

How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

Paper Structure

This paper contains 28 sections, 10 equations, 33 figures, 1 algorithm.

Figures (33)

  • Figure 1: The lookback mechanism performs conditional reasoning; The source token contains reference information that is copied into two instances, creating a pointer and an address. Next to the address in the residual stream is a payload. When necessary, the model retrieves the payload by dereferencing the pointer. Solid lines represent information flow, while the dotted line indicates the attention "looking back" from pointer to address.
  • Figure 2: Tracing information flow of crucial input tokens using causal mediation analysis.
  • Figure 3: Belief Tracking with no visibility between characters. We hypothesize that the LM tracks beliefs using two lookback mechanisms. First, in (i) Binding lookback, LM binds together each character-object-state triple in the state token residual stream. When asked about a specific character-object pair, the LM looks back to the corresponding OIs to retrieve the correct state OI. Second, in (ii) Answer lookback, LM dereferences that state OI (used as a pointer) to retrieve the token value of the correct state. Colors indicate information type, shapes indicate role of information in lookback (see Fig. \ref{['fig:lookback_mech']}), e.g., state OI is a payload () in (i) and a pointer-address () in (ii).
  • Figure 4: Answer Lookback Pointer and Payload: The causal model predicts that if we alter the "Answer Payload " of the original to instead take the value of the counterfactual answer payload, the output should change from coffee to tea; the gray curve in the line plot shows this does occur when patching residual vectors at the ":" token beyond layer $56$, providing evidence that the answer payload resides in those states. On the other hand the causal model predicts that taking the counterfactual "Answer Pointer " would change the original run output from coffee to beer---a new output that matches neither the original nor the counterfactual!---and we do see this surprising effect, again when patching layers between $34$ and $52$, providing strong evidence that the answer pointer is encoded at those layers. These results suggest the Answer Lookback occurs between layers 52 and 56.
  • Figure 5: Binding lookback Address and Payload: The causal model predicts that swapping addresses (character and object OIs; and ) and payloads (state OIs; ) should cause the binding lookback mechanism to attend to the alternate state token and retrieve its state OI. This retrieved state OI is then dereferenced by the answer lookback, producing the corresponding token as the output (e.g., beer instead of coffee). The LM’s behavior matches this prediction when we perform interchange interventions on the state token across layers 33–38. These findings support our hypothesis that both address and payload information are encoded in the residual stream of state tokens.
  • ...and 28 more figures