Table of Contents
Fetching ...

Towards Provable Emergence of In-Context Reinforcement Learning

Jiuqi Wang, Rohan Chandra, Shangtong Zhang

TL;DR

This work tackles why reinforcement learning pretraining yields parameters that enable in-context reinforcement learning (ICRL). It proves that for policy evaluation, a Transformer pretrained with a TD objective yields forward-pass dynamics that converge to the true value functions as depth grows, effectively performing in-context TD learning. Moreover, the converged parameters are global minimizers of a Norm of Expected Updates (NEU) loss for both multi-task TD and Monte Carlo pretraining, and experiments with Boyan's chain show context-length–dependent reductions in MSVE and convergence of the learned weights to the TD-optimal configuration. The results suggest that ICRL can emerge from standard RL pretraining without changing parameters during inference, with potential implications for robust, context-driven generalization across tasks. However, the analysis relies on a linear attention model and simplified prompts, indicating directions for extending to non-linear transformers and more realistic settings.

Abstract

Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent's interaction history in the new task. The agent's performance increases as the information in the context increases, with the agent's parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.

Towards Provable Emergence of In-Context Reinforcement Learning

TL;DR

This work tackles why reinforcement learning pretraining yields parameters that enable in-context reinforcement learning (ICRL). It proves that for policy evaluation, a Transformer pretrained with a TD objective yields forward-pass dynamics that converge to the true value functions as depth grows, effectively performing in-context TD learning. Moreover, the converged parameters are global minimizers of a Norm of Expected Updates (NEU) loss for both multi-task TD and Monte Carlo pretraining, and experiments with Boyan's chain show context-length–dependent reductions in MSVE and convergence of the learned weights to the TD-optimal configuration. The results suggest that ICRL can emerge from standard RL pretraining without changing parameters during inference, with potential implications for robust, context-driven generalization across tasks. However, the analysis relies on a linear attention model and simplified prompts, indicating directions for extending to non-linear transformers and more realistic settings.

Abstract

Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent's interaction history in the new task. The agent's performance increases as the information in the context increases, with the agent's parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.

Paper Structure

This paper contains 30 sections, 7 theorems, 79 equations, 10 figures, 1 table, 4 algorithms.

Key Result

Theorem 1

Given a query state $s_q \in \mathcal{S}$ and constructing $Z_0$ as eq: Z0 redef, it holds that $\lim_{L\to\infty} \text{TF}_L\qty(Z_0;\theta_L^\text{TD}) = v(s_q)$.

Figures (10)

  • Figure 1: Mean and standard error of the averaged MSVEs against context lengths. The curves are averaged over 20 random trials. The shaded areas represent the standard errors.
  • Figure 2: Mean Transformer parameters after pretraining. The parameters are averaged over 20 trials and normalized to stay in the range $[-1, 1]$.
  • Figure 3: Boyan's chain with $S$ states. Arrows indicate non-zero transition probabilities.
  • Figure 4: Mean parameters of 5-, 10-, and 60-layer Transformers after pretraining. Parameters are averaged over 20 trials and normalized to lie within the range $[-1, 1]$
  • Figure 5: Mean and standard error of the averaged MSVEs against context lengths for 5-, 10-, and 60-layer Transformers. The curves are averaged over 20 random trials. The shaded areas represent the standard errors.
  • ...and 5 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Corollary 1
  • Lemma 4
  • proof
  • proof
  • proof
  • ...and 4 more