Table of Contents
Fetching ...

Human-Timescale Adaptation in an Open-Ended Task Space

Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schmieg, Michael Chang, Natalie Clay, Adrian Collister, Vibhavari Dasagi, Lucy Gonzalez, Karol Gregor, Edward Hughes, Sheleem Kashem, Maria Loks-Thompson, Hannah Openshaw, Jack Parker-Holder, Shreya Pathak, Nicolas Perez-Nieves, Nemanja Rakicevic, Tim Rocktäschel, Yannick Schroecker, Jakub Sygnowski, Karl Tuyls, Sarah York, Alexander Zacherl, Lei Zhang

TL;DR

AdA demonstrates rapid, human-timescale adaptation in a vast open-ended RL domain by training a memory-augmented Transformer with meta-RL and automatic curricula. The approach combines XLand 2.0's production-rule dynamics, PLR/No-op auto-curricula, and distillation to scale to hundreds of millions of parameters, enabling few-shot adaptation and first-person prompting. Empirical results reveal scaling laws for model size and memory, strong adaptation across single- and multi-agent tasks, and emergent cooperative behaviors, underscoring the potential of RL foundation-model–style agents for ever-larger open-ended domains. This work provides a concrete recipe for training general, adaptive RL agents capable of rapid in-context learning and task-driven behavior refinement at test time.

Abstract

Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.

Human-Timescale Adaptation in an Open-Ended Task Space

TL;DR

AdA demonstrates rapid, human-timescale adaptation in a vast open-ended RL domain by training a memory-augmented Transformer with meta-RL and automatic curricula. The approach combines XLand 2.0's production-rule dynamics, PLR/No-op auto-curricula, and distillation to scale to hundreds of millions of parameters, enabling few-shot adaptation and first-person prompting. Empirical results reveal scaling laws for model size and memory, strong adaptation across single- and multi-agent tasks, and emergent cooperative behaviors, underscoring the potential of RL foundation-model–style agents for ever-larger open-ended domains. This work provides a concrete recipe for training general, adaptive RL agents capable of rapid in-context learning and task-driven behavior refinement at test time.

Abstract

Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.
Paper Structure (86 sections, 2 equations, 40 figures, 20 tables)

This paper contains 86 sections, 2 equations, 40 figures, 20 tables.

Figures (40)

  • Figure 1: Human timescale adaptation. Example trajectories of our agent (AdA) solving a held-out task in a complex 3D environment within minutes of test-time experience without any further agent training. Initial trials (Exploration) show a policy that uncovers hidden environment dynamics. After just seconds of test-time experience (Success), AdA finds a valid solution to the task. Later (Refinement), it improves this solution, gradually finding a more rewarding behaviour. The solid white lines show agent movement. The dashed coloured lines show the agent carrying an object of the corresponding colour. For a full description of the task, see Figure \ref{['fig:wrong-pair-disappears-explained']}. Videos of AdA's behaviour are available on our http://sites.google.com/view/adaptive-agent/ and accompanying https://youtu.be/U93bUQ1roiw.
  • Figure 2: Training our Adaptive Agent (AdA). We train a large Transformer model with meta-RL in XLand. During training, tasks are uniformly sampled, and subsequently filtered to produce an ever-changing training pool of tasks at the frontier of the agent's capabilities. After training on these tasks, the agent is capable of adapting to unseen hand-authored tasks as effectively and efficiently as humans.
  • Figure 3: XLand 2.0: a vast, smooth and diverse task space of adaptation problems. Different tasks have different adaptation requirements, such as experimentation, tool use or division of labour. For instance, in a task requiring experimentation, a player might be required to identify which objects can usefully combine, avoiding dead-ends, and then optimise the way in which they combine objects, like a toy version of experimental chemistry. Each task can be run for one or more trials, where the environment is reset between trials, but agent memory is not. Highlighted are two example tasks, Wrong Pair Disappears and Pass Over Wall Repeatedly, showing the goal, initial objects, production rules ("rules" in the figure) and how agents need to interact with them to solve the task. For full task descriptions see Appendix \ref{['appendix:probe_tasks']}.
  • Figure 4: Agent architecture. For each timestep, we embed and combine the pixel observation, goal, hand, trial and time information, production rules, previous action, and previous reward into a single vector. These observations embeddings pass in sequence to the Transformer-XL, whose output embeddings feed into an MLP value head, MLP policy head, and the Muesli LSTM model step (omitted in the diagram for brevity). See Appendix \ref{['app:agent-architecture']} for more details about our agent architecture.
  • Figure 5: Zero-shot generalisation and few-shot adaptation. We report the distribution of normalised task scores over the single-agent test set when evaluated with various numbers of trials. On the $y$-axis is the total last-trial reward relative to that of an agent fine-tuned on the test tasks (approximating "infinite trials" performance). Curves moving further towards the top right corner indicate better performance. When given more trials, the agent achieves higher scores in the last trial, showing test-time adaptation across most of the task distribution (shaded regions). The dashed line indicates the zero-shot performance of an agent trained in a regime where every episode consists of only a single trial.
  • ...and 35 more figures