Table of Contents
Fetching ...

Emergence of In-Context Reinforcement Learning from Noise Distillation

Ilya Zisman, Vladislav Kurenkov, Alexander Nikulin, Viacheslav Sinii, Sergey Kolesnikov

TL;DR

This work tackles the data bottleneck in in-context Reinforcement Learning by introducing AD$^$ε, a noise-distillation curriculum that generates learning histories without requiring thousands of RL agents or access to an optimal policy. By gradually injecting noise into demonstrator policies, the method yields synthetic trajectories that encode progressive policy improvement, enabling a Transformer to distill an in-context learning algorithm from suboptimal data. The authors demonstrate emergent in-context RL on grid-world and pixel-based 3D tasks, with the in-context agent outperforming the best data policy by up to about $2\times$, and show robustness across suboptimal trajectories and varying learning pace. Overall, AD$^$ε lowers data barriers to in-context RL and highlights learning-pace dynamics as a critical lever for generalization and adaptation in noisy learning histories.

Abstract

Recently, extensive studies in Reinforcement Learning have been carried out on the ability of transformers to adapt in-context to various environments and tasks. Current in-context RL methods are limited by their strict requirements for data, which needs to be generated by RL agents or labeled with actions from an optimal policy. In order to address this prevalent problem, we propose AD$^\varepsilon$, a new data acquisition approach that enables in-context Reinforcement Learning from noise-induced curriculum. We show that it is viable to construct a synthetic noise injection curriculum which helps to obtain learning histories. Moreover, we experimentally demonstrate that it is possible to alleviate the need for generation using optimal policies, with in-context RL still able to outperform the best suboptimal policy in a learning dataset by a 2x margin.

Emergence of In-Context Reinforcement Learning from Noise Distillation

TL;DR

This work tackles the data bottleneck in in-context Reinforcement Learning by introducing ADε, a noise-distillation curriculum that generates learning histories without requiring thousands of RL agents or access to an optimal policy. By gradually injecting noise into demonstrator policies, the method yields synthetic trajectories that encode progressive policy improvement, enabling a Transformer to distill an in-context learning algorithm from suboptimal data. The authors demonstrate emergent in-context RL on grid-world and pixel-based 3D tasks, with the in-context agent outperforming the best data policy by up to about , and show robustness across suboptimal trajectories and varying learning pace. Overall, ADε lowers data barriers to in-context RL and highlights learning-pace dynamics as a critical lever for generalization and adaptation in noisy learning histories.

Abstract

Recently, extensive studies in Reinforcement Learning have been carried out on the ability of transformers to adapt in-context to various environments and tasks. Current in-context RL methods are limited by their strict requirements for data, which needs to be generated by RL agents or labeled with actions from an optimal policy. In order to address this prevalent problem, we propose AD, a new data acquisition approach that enables in-context Reinforcement Learning from noise-induced curriculum. We show that it is viable to construct a synthetic noise injection curriculum which helps to obtain learning histories. Moreover, we experimentally demonstrate that it is possible to alleviate the need for generation using optimal policies, with in-context RL still able to outperform the best suboptimal policy in a learning dataset by a 2x margin.
Paper Structure (32 sections, 1 equation, 10 figures, 2 tables, 1 algorithm)

This paper contains 32 sections, 1 equation, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: Data acquisition for in-context RL training. While other in-context RL methods either train thousands of single-task RL algorithms to obtain their learning histories (AD) or pretrain on optimal actions (DPT), our approach AD$^\varepsilon$ alleviates these problems introducing synthetic noise injection curriculum by which learning histories are generated. Algorithms trained on this kind of data can not only generalize to unseen tasks, but also outperform the best available policy in data ($\pi^\text{data}$).
  • Figure 2: The performance of AD$^\varepsilon$ on test environments. The agent must find unseen goals by memorizing visited states and rewards. The main difference from the standard approach used in in-context RL is that we generate learning histories by infusing noise, therefore eliminating the need of training thousands of single-task RL agents. Here, we demonstrate that our data collection strategy is able to provide suitable data for training in-context RL models. The mean performance of data generating oracles is shown for comparison. The performance is averaged across three seeds with the shaded regions of one std.
  • Figure 3: Examples of Watermaze input images.
  • Figure 4: The performance of AD$^\varepsilon$ pretrained on the data generated by suboptimal policies. We show that in-context agents can outperform even the best policy available in the data by a large margin, which highlights the ability of in-context RL agents to learn without pretraining on optimal actions. In these experiments, we use a generating policy that is 40% (for Dark Room) and 50% (for Dark Key-to-Door and Watermaze) of the optimal performance in the environment. Its mean performance is shown in a blue dashed line. We observe a substantial improvement when compared to the suboptimal policy: +120% (6.36 → 14.08) in Dark Room, +74% (1.0 → 1.74) in Dark Key-to-Door, +76% (0.52 → 0.92) in Watermaze. For further comparison, we show Behavior Cloning that is unable to generalize to unseen tasks. The mean performance of the best policy available in the data is shown in light blue. The performance is averaged across 3 seeds with the shaded regions of one std.
  • Figure 5: The performance of AD$^\varepsilon$ for different suboptimal generating policies in the Key-to-Door environment with the following two subgoals: find a key, open a door using the key. To test the limits of in-context agents, we generated three datasets, each representing different performance relative to the maximum reward. This was made possible by scheduling $\varepsilon$ to a non-zero number, so that the mean performance of generating policies is bound by $\textrm{max\_perf}$. As one can observe in (b), the in-context agent manages to outperform even a significantly suboptimal policy used for data generation. However, it is important to point out that, in order for the in-context agent to successfully learn and achieve both subgoals (locating a key and door), the data must contain sufficient examples of both tasks. In the case of (b), the mean reward of the data-generating trajectories is 0.6, indicating that the agent generating the data rarely encounters a door. As a result, the in-context learner also struggles to learn the task of finding the door. Similarly, in (c), the in-context agent fails to learn effectively due to the same lack of diverse examples in the training data. We plot the mean returns of generating policies in a greenish-gray dashed line. The AD$^\varepsilon$ performance is averaged across three seeds $\pm$ 1 std.
  • ...and 5 more figures