Table of Contents
Fetching ...

TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

Geonwoo Cho, Jaegyun Im, Jihwan Lee, Hojun Yi, Sejin Kim, Sundong Kim

TL;DR

TRACED introduces a transition-aware regret approximation by adding a transition-prediction loss to traditional regret proxies and pairs it with a lightweight Co-Learnability metric to quantify cross-task transfer. This yields a unified Task Priority for environment design within the UED framework, guiding task generation and replay. Empirically, TRACED accelerates curriculum ramp-up and improves zero-shot generalization on MiniGrid and BipedalWalker, with ablations confirming the complementary roles of ATPL and Co-Learnability. The approach demonstrates faster, sample-efficient curricula and scalable performance to large, complex environments, offering a practical pathway for robust RL generalization.

Abstract

Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/

TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design

TL;DR

TRACED introduces a transition-aware regret approximation by adding a transition-prediction loss to traditional regret proxies and pairs it with a lightweight Co-Learnability metric to quantify cross-task transfer. This yields a unified Task Priority for environment design within the UED framework, guiding task generation and replay. Empirically, TRACED accelerates curriculum ramp-up and improves zero-shot generalization on MiniGrid and BipedalWalker, with ablations confirming the complementary roles of ATPL and Co-Learnability. The approach demonstrates faster, sample-efficient curricula and scalable performance to large, complex environments, offering a practical pathway for robust RL generalization.

Abstract

Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/

Paper Structure

This paper contains 55 sections, 2 theorems, 22 equations, 23 figures, 15 tables, 1 algorithm.

Key Result

Lemma 1

Define the dynamics mismatch term Under Assumptions 1 and 2, for any coupling $\Gamma_{s,a}$ of $P(\cdot\mid s,a)$ and $\hat{P}(\cdot\mid s,a)$, we have

Figures (23)

  • Figure 1: Task Priority Landscape. Task with high difficulty and high Co‑Learnability are scheduled with the highest priority in the curriculum.
  • Figure 2: Task Difficulty Calculation Workflow. The Task Difficulty Buffer (TDB) records each task’s history of approximated regret. The agent interacts with sampled tasks to collect episode trajectories, which are stored in the Rollouts Buffer (RB). For each trajectory, we compute the Positive Value Loss (PVL) and the Average Transition‑Prediction Loss (ATPL). Their sum produces the updated task difficulty (approximated regret), which is appended to the TDB to refresh each sampled task’s stored difficulty.
  • Figure 3: Method Workflow Overview. The three panels in figure depict: (1) Task Sampling: levels are drawn from the buffer based on their priority scores. (2) Priority Update: we recompute each level’s priority based on our task priority definition (Eq. \ref{['eq:task_priority']}). (3) Task Mutation: the lowest‑priority levels are mutated into new variants and reinserted into the buffer.
  • Figure 4: Held‑out Evaluation Environments. (a) Example held‑out MiniGrid mazes for zero‑shot evaluation: 4 tasks are shown (see Appendix \ref{['sec:appendix:implementation_details:minigrid']} for all 12 task definitions). (b) Example held‑out BipedalWalker terrains for zero‑shot evaluation: 2 tasks are shown (see Appendix \ref{['sec:appendix:implementation_details:bipedalwalker']} for all 6 task definitions).
  • Figure 5: Zero‑Shot Transfer Performance in MiniGrid. Aggregated solved rates on held‑out MiniGrid mazes after 10k and 20k PPO updates. TRACED at 10k updates outperforms baselines at 20k updates.
  • ...and 18 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1: Regret Approximation Improvement with ATPL
  • proof