XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning
Alexander Nikulin, Ilya Zisman, Alexey Zemtsov, Vladislav Kurenkov
TL;DR
This work addresses the need for scalable, diverse benchmarks to study in-context reinforcement learning (ICRL). It introduces XLand-100B, a large-scale dataset built on XLand-MiniGrid, containing roughly 30k tasks, 100B transitions, and 2.5B episodes, collected via a multi-stage PPO-based pipeline and complemented by a smaller XLand-Trivial-20B for rapid experimentation. The authors evaluate two ICRL paradigms, Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), finding that AD exhibits emergent in-context adaptation on simpler tasks and benefits from larger, more diverse data, while DPT struggles due to partial observability. The dataset and tools aim to democratize ICRL research and lay a foundation for scaling ICRL toward more generalist agents, though challenges remain for complex rule sets and POMDP-like environments. Overall, XLand-100B provides a substantial, public resource for advancing understanding of learning-to-learn behaviors in RL from rich learning histories.
Abstract
Following the success of the in-context learning paradigm in large-scale language and computer vision models, the recently emerging field of in-context reinforcement learning is experiencing a rapid growth. However, its development has been held back by the lack of challenging benchmarks, as all the experiments have been carried out in simple environments and on small-scale datasets. We present XLand-100B, a large-scale dataset for in-context reinforcement learning based on the XLand-MiniGrid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly $30,000$ different tasks, covering $100$B transitions and 2.5B episodes. It took 50,000 GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. We also benchmark common in-context RL baselines and show that they struggle to generalize to novel and diverse tasks. With this substantial effort, we aim to democratize research in the rapidly growing field of in-context reinforcement learning and provide a solid foundation for further scaling.
