Table of Contents
Fetching ...

Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

Daniel Furelos-Blanco, Charles Pert, Frederik Kelbel, Alex F. Spies, Alessandra Russo, Michael Dennis

TL;DR

ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications) tackles the challenge of training general agents to follow complex instructions in varied environments by co-designing task-level pairs and environmental levels through unsupervised environment design. Tasks are expressed as reward machines (RMs) and levels as Minigrid environments, with a policy conditioned on RM graphs via a graph neural network and observations via a CNN, trained with PPO. The core contributions are (i) extending UED to jointly co-evolve tasks and levels to produce solvable yet challenging problem autocurricula, (ii) a mutation-driven ACCEL variant that mutates both RM structure and environment, and (iii) a comprehensive evaluation suite showing significant gains over random sampling, especially when solvable problem density is low, plus ablations revealing the value of joint task-level mutations. The results demonstrate robust generalization and faster convergence, including strong performance from ACCEL-0, which derives complexity purely through mutations from simple starting problems, underscoring the practicality of task-level curriculum design for real-world generalization.

Abstract

Training general agents to follow complex instructions (tasks) in intricate environments (levels) remains a core challenge in reinforcement learning. Random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. While unsupervised environment design (UED) has proven effective at automatically designing level curricula, prior work has only considered a fixed task. We present ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications), a novel method that generates joint autocurricula over tasks and levels. Our approach builds upon UED to automatically produce solvable yet challenging task-level pairs for policy training. To evaluate ATLAS and drive progress in the field, we introduce an evaluation suite that models tasks as reward machines in Minigrid levels. Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. We further show that mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.

Beyond Fixed Tasks: Unsupervised Environment Design for Task-Level Pairs

TL;DR

ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications) tackles the challenge of training general agents to follow complex instructions in varied environments by co-designing task-level pairs and environmental levels through unsupervised environment design. Tasks are expressed as reward machines (RMs) and levels as Minigrid environments, with a policy conditioned on RM graphs via a graph neural network and observations via a CNN, trained with PPO. The core contributions are (i) extending UED to jointly co-evolve tasks and levels to produce solvable yet challenging problem autocurricula, (ii) a mutation-driven ACCEL variant that mutates both RM structure and environment, and (iii) a comprehensive evaluation suite showing significant gains over random sampling, especially when solvable problem density is low, plus ablations revealing the value of joint task-level mutations. The results demonstrate robust generalization and faster convergence, including strong performance from ACCEL-0, which derives complexity purely through mutations from simple starting problems, underscoring the practicality of task-level curriculum design for real-world generalization.

Abstract

Training general agents to follow complex instructions (tasks) in intricate environments (levels) remains a core challenge in reinforcement learning. Random sampling of task-level pairs often produces unsolvable combinations, highlighting the need to co-design tasks and levels. While unsupervised environment design (UED) has proven effective at automatically designing level curricula, prior work has only considered a fixed task. We present ATLAS (Aligning Tasks and Levels for Autocurricula of Specifications), a novel method that generates joint autocurricula over tasks and levels. Our approach builds upon UED to automatically produce solvable yet challenging task-level pairs for policy training. To evaluate ATLAS and drive progress in the field, we introduce an evaluation suite that models tasks as reward machines in Minigrid levels. Experiments demonstrate that ATLAS vastly outperforms random sampling approaches, particularly when sampling solvable pairs is unlikely. We further show that mutations leveraging the structure of both tasks and levels accelerate convergence to performant policies.

Paper Structure

This paper contains 80 sections, 6 equations, 28 figures, 4 tables.

Figures (28)

  • Figure 1: A problem consisting of a Minigrid level and an RM task for "go to a ball, then go to a red square".
  • Figure 2: Overview of Aligning Tasks and Levels for Autocurricula of Specifications ATLASATLAS (Aligning Tasks and Levels for Autocurricula of Specifications) instantiated with $\textnormal{PLR}^\bot$ and ACCEL. The UED loop (left) samples problems---i.e., task-level pairs---from either a generator or a buffer of high-regret problems. ACCEL provides problems that result from mutating selected buffer problems. The policy network (right) processes observations via a convolutional neural network (CNN) and RM tasks via a graph neural network (GNN). The GNN produces representations for all RM states. The current state's embedding is concatenated with the CNN features and passed through a recurrent neural network (RNN) to capture history. The resulting representation is used to generate actions and value estimates. Policy rollouts are used to train the network (for buffer-sourced problems), and to compute regret scores, which determine if new problems enter the buffer or update existing ones. Unlike $\textnormal{PLR}^\bot$ and ACCEL, DR trains policies only from problems produced by the generator.
  • Figure 3: Performance of Aligning Tasks and Levels for Autocurricula of Specifications ATLASATLAS (Aligning Tasks and Levels for Autocurricula of Specifications) variants on (a) the worst-case problems and (b) a challenging hand-designed test set.
  • Figure 4: Zero-shot performance of Aligning Tasks and Levels for Autocurricula of Specifications ATLASATLAS (Aligning Tasks and Levels for Autocurricula of Specifications) on hand-designed problems. Symbols represent balls (), squares (), keys (), and closed/locked/open/unspecified doors (/ // ). Unspecified doors can match any state. Single symbols indicate $\mathtt{front}$ propositions, pairs indicate $\mathtt{next}$ propositions. Striped patterns represent an unspecified color (e.g., cc stands for $\mathtt{front}\_\mathtt{square}$).
  • Figure 5: Emergent complexity metrics for problems in the buffer.
  • ...and 23 more figures

Theorems & Definitions (2)

  • Example 1
  • Example 2