Table of Contents
Fetching ...

Refining Minimax Regret for Unsupervised Environment Design

Michael Beukman, Samuel Coward, Michael Matthews, Mattie Fellows, Minqi Jiang, Michael Dennis, Jakob Foerster

TL;DR

This work identifies a fundamental shortcoming of minimax regret in unsupervised environment design: in partially observable settings with irreducible regret, MMR can cause learning to stagnate by overemphasising the highest-regret levels. It introduces Bayesian level-perfect Minimax Regret (BLP), a refinement that preserves MMR guarantees while progressively improving worst-case regret on non-highest-regret levels by conditioning on trajectories realizable under prior adversaries. The ReMiDi algorithm implements this iterative refinement, yielding policies that align with Perfect Bayesian reasoning and demonstrate continued learning in domains where standard MMR stalls. Across toy, MiniGrid maze, lever, and Brax robotics experiments, ReMiDi outperforms regret-based baselines, showing stronger generalisation and robustness in open-ended task spaces.

Abstract

In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.

Refining Minimax Regret for Unsupervised Environment Design

TL;DR

This work identifies a fundamental shortcoming of minimax regret in unsupervised environment design: in partially observable settings with irreducible regret, MMR can cause learning to stagnate by overemphasising the highest-regret levels. It introduces Bayesian level-perfect Minimax Regret (BLP), a refinement that preserves MMR guarantees while progressively improving worst-case regret on non-highest-regret levels by conditioning on trajectories realizable under prior adversaries. The ReMiDi algorithm implements this iterative refinement, yielding policies that align with Perfect Bayesian reasoning and demonstrate continued learning in domains where standard MMR stalls. Across toy, MiniGrid maze, lever, and Brax robotics experiments, ReMiDi outperforms regret-based baselines, showing stronger generalisation and robustness in open-ended task spaces.

Abstract

In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.
Paper Structure (38 sections, 5 theorems, 12 equations, 9 figures, 9 tables, 1 algorithm)

This paper contains 38 sections, 5 theorems, 12 equations, 9 figures, 9 tables, 1 algorithm.

Key Result

Theorem 4.3

Suppose we have a UPOMDP with level space $\Theta$. Let $\pi$ be some policy and $\Theta' \subseteq \Theta$ be some subset of levels. Let $(\pi', \Lambda')$ denote a policy and adversary at Nash equilibrium for the refined minimax regret game under $\pi$ and $\Theta'$. Then, (a) for all $\theta \in

Figures (9)

  • Figure 1: An illustration of the regret stagnation problem of minimax regret that our work addresses. In the T-mazes, the reward for reaching the goal is $1.0$ and $-1.0$ for failing. The reward for the mazes is $0.9$ for reaching the goal, and zero otherwise. Regret-based UED methods gravitate towards sampling high regret environments (T-mazes in this case with a regret of $1.0$), even if the agent cannot improve on these levels. This is despite the existence of non-high-regret levels (the mazes, with regret upper-bounded by $0.9$) on which the agent can still improve.
  • Figure 2: The BLP solution concept iteratively restricts the sets of policies by altering behaviour only in certain trajectories. (a) Each node corresponds to a level-trajectory pair, and the root node indicates that the adversary samples some level $\theta$. MMR results in adversary $\textcolor{blue}{\Lambda_1}$ and policy $\textcolor{blue}{\pi_1}$, reaching the nodes in blue. MMR would terminate at step 1, but we instead refine our policy further. In step 2, we learn $\textcolor{red}{\Lambda_2}$ and $\textcolor{red}{\pi_2}$. Following $\textcolor{blue}{\pi_1}$ in $\textcolor{red}{\theta_3}$ leads to a trajectory that never happens under $\textcolor{blue}{\Lambda_1}$ and $\textcolor{blue}{\pi_1}$. In step 3, we fill in behaviour for $\textcolor{goodgreen}{\theta_4}$, as these trajectories are never reached under any of the previous MMR adversaries. We terminate after all environments have been sampled by an adversary. (b) Iterative refinement reduces the set of policies we consider, improving upon the initial MMR policy $\textcolor{blue}{\pi_1}$. (c) If we have a minimax regret adversary $\textcolor{blue}{\Lambda_1}$, we are only guaranteed that the regret on all other levels must be at or below the regret of levels in the support of $\textcolor{blue}{\Lambda_1}$ (indicated by the dashed blue line). Refining our policy improves the bound on all levels except those sampled by $\textcolor{blue}{\Lambda_1}$. We iterate this process until all levels have been sampled, monotonically improving the regret bound on all non-previously-sampled levels.
  • Figure 3: Plotting the regret of MMR UED throughout training for each of the $6$ trajectories. Here the optimal regret is $0$ for each trajectory, and minimax regret achieves this.
  • Figure 4: The greatest regret on each trajectory $\tau$ for (left) Standard Minimax Regret UED; and (right) ReMiDi. ReMiDi obtains optimal regret on all levels, whereas MMR does not.
  • Figure 5: The Adversary's probability of sampling each environment $\theta$ for (left) Standard Minimax Regret UED; and (b) ReMiDi. MMR UED exclusively samples the irreducible-regret levels.
  • ...and 4 more figures

Theorems & Definitions (11)

  • Definition 4.1
  • Definition 4.2
  • Theorem 4.3
  • proof
  • Theorem 4.4
  • Definition 4.5
  • Theorem 4.6
  • Theorem \ref{thrm:minimax_refinement_theorem}
  • proof
  • Theorem \ref{thrm:bayes_perf}
  • ...and 1 more