Towards Provable Emergence of In-Context Reinforcement Learning
Jiuqi Wang, Rohan Chandra, Shangtong Zhang
TL;DR
This work tackles why reinforcement learning pretraining yields parameters that enable in-context reinforcement learning (ICRL). It proves that for policy evaluation, a Transformer pretrained with a TD objective yields forward-pass dynamics that converge to the true value functions as depth grows, effectively performing in-context TD learning. Moreover, the converged parameters are global minimizers of a Norm of Expected Updates (NEU) loss for both multi-task TD and Monte Carlo pretraining, and experiments with Boyan's chain show context-length–dependent reductions in MSVE and convergence of the learned weights to the TD-optimal configuration. The results suggest that ICRL can emerge from standard RL pretraining without changing parameters during inference, with potential implications for robust, context-driven generalization across tasks. However, the analysis relies on a linear attention model and simplified prompts, indicating directions for extending to non-linear transformers and more realistic settings.
Abstract
Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent's interaction history in the new task. The agent's performance increases as the information in the context increases, with the agent's parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.
