CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Abdus Salam Azad; Izzeddin Gur; Jasper Emhoff; Nathaniel Alexis; Aleksandra Faust; Pieter Abbeel; Ion Stoica

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, Ion Stoica

TL;DR

CLUTR addresses the sample inefficiency and non-stationarity of Unsupervised Environment Design by decoupling latent task representation learning from curriculum optimization. It pretrains a recurrent VAE to learn a latent task manifold from random task sequences and then uses a fixed latent space as a context for a minimax regret-based teacher to craft curricula, enabling faster, more stable zero-shot generalization. Empirically, CLUTR outperforms PAIRED in CarRacing (10.6X zero-shot gains) and MiniGrid navigation (45% higher solve rate) while requiring far fewer environment interactions (up to 500X less) and demonstrates that two-stage optimization with a sorted latent space is superior to joint learning. The work highlights the benefits of explicit latent-task modeling for UED, offering a practical route to robust, sample-efficient curriculum design in complex, partially observable domains, with potential extensions to more realistic task generation and broader environments.

Abstract

Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the generated tasks. This is a non-stationary process where the task distribution evolves along with agent policies; creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel unsupervised curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by maximizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. Using the fixed-pretrained task manifold, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in the challenging CarRacing and navigation environments: achieving 10.6X and 45\% improvement in zero-shot generalization, respectively. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, while requiring 500X fewer environment interactions.

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

TL;DR

Abstract

Paper Structure (39 sections, 4 equations, 34 figures, 3 tables, 1 algorithm)

This paper contains 39 sections, 4 equations, 34 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Background
Unsupervised Environment Design (UED)
PAIRED
Curriculum Learning via Unsupervised Task Representation Learning
Formulation of CLUTR
Unsupervised Latent Task Representation Learning
CLUTR
CLUTR in the context of contemporary UED method landscape
Experiments
CLUTR Performance on Pixel-Based Continuous Control CarRacing Environment
CLUTR Performance on Partially Observable Navigation Tasks on MiniGrid
Learning task manifold and curriculum: Joint vs Two-staged Optimization
Impact of sorting VAE data on solving Combinatorial Explosion
...and 24 more sections

Figures (34)

Figure 1: Hierarchical Graphical Model for CLUTR
Figure 2: Comparison on the F1 Benchmark comprising 20 tracks modeled on real-life F1 racing tracks collected from 10 independent runs. CLUTR achieves 10.6X and 82% higher returns than PAIRED with standard and flexible regret objectives, respectively. CLUTR also performs comparably to the attention-based non-UED CarRacing SOTA, while requiring 500X fewer environment interactions.
Figure 3: Zero-shot generalization over the course of training by periodic evaluation on a subset of three F1 tracks: Singapore, Germany, and Italy. CLUTR indicate significantly better sample efficiency than PAIRED.
Figure 4: Mean solve rate on the test dataset comprising 16 novel nagivation tasks from 5 independent runs. CLUTR achieves 45% and 35% higher solve rate than PAIRED, with standard and flexible regret objectives, respectively.
Figure 5: Agent solved rate on the 16 unseen grids from paired during training. CLUTR shows better sample efficiency and generalization than PAIRED. The results show an average of 5 independent runs..
...and 29 more figures

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

TL;DR

Abstract

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (34)