Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

Dexun Li; Sidney Tio; Pradeep Varakantham

Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

Dexun Li, Sidney Tio, Pradeep Varakantham

TL;DR

This paper tackles unsupervised environment design (UED) under tight resource constraints by introducing SHED, a hierarchical MDP framework where a teacher uses a compact representation of the student’s policy to generate curricula. It further augments teacher training with a diffusion-based world model that synthetically generates student-policy trajectories, enabling off-policy learning and reducing the need for expensive real interactions. SHED’s upper-level teacher and lower-level student MDPs, together with evaluation-environment discretization and a calibrated teacher reward, yield curricula that produce robust zero-shot transfer across Lunar Lander, Bipedal Walker, and Maze domains while using substantially fewer interactions. The approach requires an initial investment of approximately 2500 environments over 50 episodes, after which the trained teacher efficiently trains new students with only 50 environments, offering practical benefits for resource-limited settings and sim-to-real transfers.

Abstract

Unsupervised Environment Design (UED) has emerged as a promising approach to developing general-purpose agents through automated curriculum generation. Popular UED methods focus on Open-Endedness, where teacher algorithms rely on stochastic processes for infinite generation of useful environments. This assumption becomes impractical in resource-constrained scenarios where teacher-student interaction opportunities are limited. To address this challenge, we introduce a hierarchical Markov Decision Process (MDP) framework for environment design. Our framework features a teacher agent that leverages student policy representations derived from discovered evaluation environments, enabling it to generate training environments based on the student's capabilities. To improve efficiency, we incorporate a generative model that augments the teacher's training dataset with synthetic data, reducing the need for teacher-student interactions. In experiments across several domains, we show that our method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode. The results suggest the applicability of our approach in settings where training opportunities are limited.

Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

TL;DR

Abstract

Paper Structure (43 sections, 2 theorems, 28 equations, 16 figures, 1 table, 1 algorithm)

This paper contains 43 sections, 2 theorems, 28 equations, 16 figures, 1 table, 1 algorithm.

Introduction
Related Work
Unsupervised Environment Design
Generative Models in UED
Preliminaries
Underspecified Partially Observable MDP
Diffusion Probabilistic Models
Approach
Overview
Hierarchical Environment Design
Upper-level Teacher MDP
Lower-level Student MDP
Diffusion-Based World Model for Efficient Teacher Training
Generate Synthetic Trajectories
Evaluation Environment Selection for Student Policy Representation
...and 28 more sections

Key Result

Theorem 4.1

There exists a finite evaluation environment set that can capture the student's general capabilities, allowing the performance vector $[p_1, \dots, p_m]$ to serve as an effective representation of student policy $\pi$.

Figures (16)

Figure 1: The overall framework of SHED. SHED uses student's performance on select evaluation environments as its state to suggest the next appropriate challenge for students to train in.
Figure 2: Mean zero-shot transfer performance on test environments during the final teacher episode for Lunar Lander (left) and BipedalWalker (right). Shading indicates standard error across five independent runs with different random seeds.
Figure 3: Mean rewards on Maze test environments during the final teacher episode. Shading indicates standard error across five independent runs.
Figure 4: Normalized performance over 5 runs in Maze domain. Higher IQM scores and lower optimality gaps indicate high performance.
Figure 5: The distribution of the real $s^\prime$ and the synthetic $s^\prime$ conditioned on $(s,a)$.
...and 11 more figures

Theorems & Definitions (2)

Theorem 4.1
Theorem B.1

Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

TL;DR

Abstract

Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (2)