Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning
Dexun Li, Sidney Tio, Pradeep Varakantham
TL;DR
This paper tackles unsupervised environment design (UED) under tight resource constraints by introducing SHED, a hierarchical MDP framework where a teacher uses a compact representation of the student’s policy to generate curricula. It further augments teacher training with a diffusion-based world model that synthetically generates student-policy trajectories, enabling off-policy learning and reducing the need for expensive real interactions. SHED’s upper-level teacher and lower-level student MDPs, together with evaluation-environment discretization and a calibrated teacher reward, yield curricula that produce robust zero-shot transfer across Lunar Lander, Bipedal Walker, and Maze domains while using substantially fewer interactions. The approach requires an initial investment of approximately 2500 environments over 50 episodes, after which the trained teacher efficiently trains new students with only 50 environments, offering practical benefits for resource-limited settings and sim-to-real transfers.
Abstract
Unsupervised Environment Design (UED) has emerged as a promising approach to developing general-purpose agents through automated curriculum generation. Popular UED methods focus on Open-Endedness, where teacher algorithms rely on stochastic processes for infinite generation of useful environments. This assumption becomes impractical in resource-constrained scenarios where teacher-student interaction opportunities are limited. To address this challenge, we introduce a hierarchical Markov Decision Process (MDP) framework for environment design. Our framework features a teacher agent that leverages student policy representations derived from discovered evaluation environments, enabling it to generate training environments based on the student's capabilities. To improve efficiency, we incorporate a generative model that augments the teacher's training dataset with synthetic data, reducing the need for teacher-student interactions. In experiments across several domains, we show that our method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode. The results suggest the applicability of our approach in settings where training opportunities are limited.
