Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning
Jiuqi Wang, Ethan Blaser, Hadi Daneshmand, Shangtong Zhang
TL;DR
This paper investigates in-context reinforcement learning (ICRL) with a focus on policy evaluation, providing both empirical and theoretical evidence that a pretrained transformer can implement temporal-difference (TD) methods in its forward pass. It constructs an explicit linear-transformer setup that performs batch TD(0) during inference and shows that, under TD-based multi-task pretraining, the network weights converge to TD-like structures within an invariant set of the training dynamics. The authors extend the analysis to TD(λ) and outline how residual gradient and average-reward TD can also be realized in-context, arguing for a broader class of in-context RL capabilities. The work positions TD as a natural emergent algorithm from reinforcement pretraining and proposes a path toward white-box understanding of ICRL, while noting limitations such as reliance on linear attention and policy-evaluation-centric experiments. Overall, the findings suggest that forward-pass RL algorithms can be learned and deployed without parameter updates, with potential implications for rapid generalization across unseen tasks.
Abstract
Traditionally, reinforcement learning (RL) agents learn to solve new tasks by updating their neural network parameters through interactions with the task environment. However, recent works demonstrate that some RL agents, after certain pretraining procedures, can learn to solve unseen new tasks without parameter updates, a phenomenon known as in-context reinforcement learning (ICRL). The empirical success of ICRL is widely attributed to the hypothesis that the forward pass of the pretrained agent neural network implements an RL algorithm. In this paper, we support this hypothesis by showing, both empirically and theoretically, that when a transformer is trained for policy evaluation tasks, it can discover and learn to implement temporal difference learning in its forward pass.
