Table of Contents
Fetching ...

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Alexander Nikulin, Ilya Zisman, Alexey Zemtsov, Vladislav Kurenkov

TL;DR

This work addresses the need for scalable, diverse benchmarks to study in-context reinforcement learning (ICRL). It introduces XLand-100B, a large-scale dataset built on XLand-MiniGrid, containing roughly 30k tasks, 100B transitions, and 2.5B episodes, collected via a multi-stage PPO-based pipeline and complemented by a smaller XLand-Trivial-20B for rapid experimentation. The authors evaluate two ICRL paradigms, Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), finding that AD exhibits emergent in-context adaptation on simpler tasks and benefits from larger, more diverse data, while DPT struggles due to partial observability. The dataset and tools aim to democratize ICRL research and lay a foundation for scaling ICRL toward more generalist agents, though challenges remain for complex rule sets and POMDP-like environments. Overall, XLand-100B provides a substantial, public resource for advancing understanding of learning-to-learn behaviors in RL from rich learning histories.

Abstract

Following the success of the in-context learning paradigm in large-scale language and computer vision models, the recently emerging field of in-context reinforcement learning is experiencing a rapid growth. However, its development has been held back by the lack of challenging benchmarks, as all the experiments have been carried out in simple environments and on small-scale datasets. We present XLand-100B, a large-scale dataset for in-context reinforcement learning based on the XLand-MiniGrid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly $30,000$ different tasks, covering $100$B transitions and 2.5B episodes. It took 50,000 GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. We also benchmark common in-context RL baselines and show that they struggle to generalize to novel and diverse tasks. With this substantial effort, we aim to democratize research in the rapidly growing field of in-context reinforcement learning and provide a solid foundation for further scaling.

XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

TL;DR

This work addresses the need for scalable, diverse benchmarks to study in-context reinforcement learning (ICRL). It introduces XLand-100B, a large-scale dataset built on XLand-MiniGrid, containing roughly 30k tasks, 100B transitions, and 2.5B episodes, collected via a multi-stage PPO-based pipeline and complemented by a smaller XLand-Trivial-20B for rapid experimentation. The authors evaluate two ICRL paradigms, Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), finding that AD exhibits emergent in-context adaptation on simpler tasks and benefits from larger, more diverse data, while DPT struggles due to partial observability. The dataset and tools aim to democratize ICRL research and lay a foundation for scaling ICRL toward more generalist agents, though challenges remain for complex rule sets and POMDP-like environments. Overall, XLand-100B provides a substantial, public resource for advancing understanding of learning-to-learn behaviors in RL from rich learning histories.

Abstract

Following the success of the in-context learning paradigm in large-scale language and computer vision models, the recently emerging field of in-context reinforcement learning is experiencing a rapid growth. However, its development has been held back by the lack of challenging benchmarks, as all the experiments have been carried out in simple environments and on small-scale datasets. We present XLand-100B, a large-scale dataset for in-context reinforcement learning based on the XLand-MiniGrid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly different tasks, covering B transitions and 2.5B episodes. It took 50,000 GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. We also benchmark common in-context RL baselines and show that they struggle to generalize to novel and diverse tasks. With this substantial effort, we aim to democratize research in the rapidly growing field of in-context reinforcement learning and provide a solid foundation for further scaling.
Paper Structure (26 sections, 24 figures, 10 tables)

This paper contains 26 sections, 24 figures, 10 tables.

Figures (24)

  • Figure 1: Visualization of a generic XLand-MiniGrid environment. Grid layout should be selected in advance, while the positions of the objects are randomized on each reset. For the dataset we use simpler layout with just one room, see \ref{['apndx:rooms-viz']}.
  • Figure 2: Evaluation return for multi-task goal-conditioned reccurent PPO pretraining on 65k tasks. Pretrained agent was further used a starting point for single-task finetuning during dataset collection.
  • Figure 3: Single-task evaluation curves on 36 hard tasks for policies trained from scratch or fine-tuned from multi-task pre-trained checkpoints. See \ref{['app:collection']} for curves on tasks of all difficulty.
  • Figure 4: Distribution of the tasks by difficulty sampled initially and in the resulting dataset. To ensure the quality, we filtered tasks where the final return was below $0.3$ or the data was corrupted due to some errors during training.
  • Figure 5: Learning histories for the XLand-100B dataset separated by number of rules. For visual clarity, we show only a sample of the possible number of rules and normalize the number of episodes, as they may vary considerably.
  • ...and 19 more figures