Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger
TL;DR
The paper probes how language models represent and update characters' beliefs, introducing the CausalToM dataset to enable counterfactual causal analysis of Theory of Mind in LMs. It uncovers a robust Lookback mechanism comprising Binding, Answer, and Visibility lookbacks, which bind character-object-state triples, dereference state pointers, and incorporate observed actions into beliefs via a QK-circuit in residual streams. Through interchange interventions and causal abstraction methods, the authors map these components to specific LM subspaces and layers, providing mechanistic evidence for belief tracking that extends to visibility scenarios. The findings generalize across multiple LMs and relate to the BigToM benchmark, advancing our understanding of the internal computations enabling nontrivial belief reasoning in transformers and guiding future interpretability and safety research.
Abstract
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters' beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovers a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
