Sparse Autoencoders Reveal Temporal Difference Learning in Large Language Models
Can Demircan, Tankred Saanum, Akshay K. Jagadish, Marcel Binz, Eric Schulz
TL;DR
This work investigates how large language models perform reinforcement learning in-context by analyzing Llama 3 70B with Sparse Autoencoders to extract low-dimensional, TD-like latents from the residual stream. The authors show that representations resembling TD errors, $Q$-values, and successor representations can emerge across transformer blocks even though the model is trained only on next-token prediction, and they demonstrate causal roles for these latents via targeted interventions. Across three tasks—Two-Step, Grid World, and a graph-learning paradigm—the approach reveals both local and global RL-like structure and demonstrates that manipulating TD latents can systematically alter policy and internal representations. The study offers a concrete methodology for mechanistic in-context learning analysis and links computational ideas from reinforcement learning to neural representations observed in both artificial and biological systems.
Abstract
In-context learning, the ability to adapt based on a few examples in the input prompt, is a ubiquitous feature of large language models (LLMs). However, as LLMs' in-context learning abilities continue to improve, understanding this phenomenon mechanistically becomes increasingly important. In particular, it is not well-understood how LLMs learn to solve specific classes of problems, such as reinforcement learning (RL) problems, in-context. Through three different tasks, we first show that Llama $3$ $70$B can solve simple RL problems in-context. We then analyze the residual stream of Llama using Sparse Autoencoders (SAEs) and find representations that closely match temporal difference (TD) errors. Notably, these representations emerge despite the model only being trained to predict the next token. We verify that these representations are indeed causally involved in the computation of TD errors and $Q$-values by performing carefully designed interventions on them. Taken together, our work establishes a methodology for studying and manipulating in-context learning with SAEs, paving the way for a more mechanistic understanding.
