Table of Contents
Fetching ...

XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX

Alexander Nikulin, Vladislav Kurenkov, Ilya Zisman, Artem Agarkov, Viacheslav Sinii, Sergey Kolesnikov

TL;DR

XLand-MiniGrid is a suite of tools and grid-world environments for meta-reinforcement learning research designed to be highly scalable and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources.

Abstract

Inspired by the diversity and depth of XLand and the simplicity and minimalism of MiniGrid, we present XLand-MiniGrid, a suite of tools and grid-world environments for meta-reinforcement learning research. Written in JAX, XLand-MiniGrid is designed to be highly scalable and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources. Along with the environments, XLand-MiniGrid provides pre-sampled benchmarks with millions of unique tasks of varying difficulty and easy-to-use baselines that allow users to quickly start training adaptive agents. In addition, we have conducted a preliminary analysis of scaling and generalization, showing that our baselines are capable of reaching millions of steps per second during training and validating that the proposed benchmarks are challenging. XLand-MiniGrid is open-source and available at https://github.com/dunnolab/xland-minigrid.

XLand-MiniGrid: Scalable Meta-Reinforcement Learning Environments in JAX

TL;DR

XLand-MiniGrid is a suite of tools and grid-world environments for meta-reinforcement learning research designed to be highly scalable and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources.

Abstract

Inspired by the diversity and depth of XLand and the simplicity and minimalism of MiniGrid, we present XLand-MiniGrid, a suite of tools and grid-world environments for meta-reinforcement learning research. Written in JAX, XLand-MiniGrid is designed to be highly scalable and can potentially run on GPU or TPU accelerators, democratizing large-scale experimentation with limited resources. Along with the environments, XLand-MiniGrid provides pre-sampled benchmarks with millions of unique tasks of varying difficulty and easy-to-use baselines that allow users to quickly start training adaptive agents. In addition, we have conducted a preliminary analysis of scaling and generalization, showing that our baselines are capable of reaching millions of steps per second during training and validating that the proposed benchmarks are challenging. XLand-MiniGrid is open-source and available at https://github.com/dunnolab/xland-minigrid.
Paper Structure (23 sections, 19 figures, 7 tables)

This paper contains 23 sections, 19 figures, 7 tables.

Figures (19)

  • Figure 1: Visualization of how the production rules in XLand-MiniGrid work, exemplified by a few steps in the environment. In the first steps, the agent picks up the blue pyramid and places it next to the purple square. The NEAR production rule is then triggered, which transforms both objects into a red circle. See \ref{['fig:ruleset-demo']} and \ref{['rules-and-goals']} for additional details.
  • Figure 2: Visualization of a specific sampled task (see \ref{['fig:task-tree']}) in XLand-MiniGrid. We highlighted the optimal path to solve this particular task. The agent needs to take the blue pyramid and put it near the purple square in order to transform both objects into a red circle. To complete the goal, a red circle needs to be placed near the green circle. However, placing the purple square near the yellow circle will make the task unsolvable in this trial. Initial positions of objects are randomized on each reset. Rules and goals are hidden from the agent.
  • Figure 3: Basic example usage of XLand-MiniGrid.
  • Figure 4: Visualization of a specific task tree with depth two, sampled according to the procedure described in \ref{['benchmarks']}. The root of the tree is a goal to be achieved by the agent, while all other nodes are production rules describing possible transformations. At the beginning of each episode, only the input objects of the leaf production rules are placed on the grid. In addition to the main task tree, the distractor production rules can be sampled. They contain already used objects to introduce dead ends. All of this together is what we call a ruleset, as it defines the task.
  • Figure 5: Distribution of the number of rules for the available benchmark configurations. One can see that each successive benchmark offers an increasingly diverse distribution of tasks, while still including tasks from the previous benchmarks. The average task complexity, as well as tree depth, also increases. See \ref{['benchmarks']} for the generation procedure and \ref{['app:benchmarks']} for the exact generation configuration. Besides, users can generate and load custom benchmarks easily, even with a custom generation procedure, as long as the final format is the same.
  • ...and 14 more figures