Reward-Free Curricula for Training Robust World Models

Marc Rigter; Minqi Jiang; Ingmar Posner

Reward-Free Curricula for Training Robust World Models

Marc Rigter, Minqi Jiang, Ingmar Posner

TL;DR

Addresses training robust world models under reward-free exploration by formulating Reward-Free Minimax Regret and deriving an upper bound that links regret to the maximum latent-dynamics error across environment instantiations. This theoretical insight motivates WAKER, an active curriculum that selects environment settings using ensemble disagreement to minimize the worst-case latent-transition error, with data gathered both from real environments and imagined rollouts in a DreamerV2-based world model. The approach is instantiated in pixel-based domains and evaluated against baselines like Domain Randomisation and oracle curricula, demonstrating improved robustness and better out-of-distribution generalisation without requiring rewards during exploration. The work provides a principled, reward-free pathway to scale robust world-model pretraining and zero-shot task adaptation in diverse, unseen environments.

Abstract

There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.

Reward-Free Curricula for Training Robust World Models

TL;DR

Abstract

Paper Structure (40 sections, 3 theorems, 25 equations, 23 figures, 6 tables, 1 algorithm)

This paper contains 40 sections, 3 theorems, 25 equations, 23 figures, 6 tables, 1 algorithm.

Introduction
Preliminaries
Approach
World Models for Underspecified POMDPs
Reward-Free Minimax Regret: Problem Definition
Theoretical Motivation
Weighted Acquisition of Knowledge across Environments for Robustness
Experiments
Related Work
Conclusion
Acknowledgements
Proof of Proposition \ref{['prop:1']}
Key Additional Results
Training a Single World Model for Two Domains
Out of Distribution Evaluation: Full Task Results
...and 25 more sections

Key Result

Proposition 1

Let $\widehat{T}$ be the learnt latent dynamics in the world model. Assume the existence of a representation model $q$ that adheres to Assumption ass:q, and let $T$ be the true latent dynamics according to Assumption ass:q. Then, for any parameter setting $\theta$ and reward function $R$, the regret where $d(\pi, \mathcal{M})$ denotes the state-action distribution of $\pi$ in MDP $\mathcal{M}$, an

Figures (23)

Figure 1: a) WAKER uses error estimates for each environment to choose the next environment to sample data from, $\mathcal{P}_\theta$. A trajectory $\tau_\theta$ is collected by rolling out exploration policy $\pi^{\textnormal{expl}}$ in the selected environment. $\tau_\theta$ is added to the data buffer $D$ which is used to train the world model, $W$. Imagined trajectories in $W$ are used to update the error estimates. b) In the world model, each environment is encoded to a subset, $Z_\theta$, of the latent space $Z$ by representation model $q$.
Figure 1: Robustness evaluation: CVaR$_{0.1}$ of policies evaluated on 100 randomly sampled environments.
Figure 2: Example training environments. Rows: Terrain Walker, Terrain Hopper, Clean Up, Car Clean Up.
Figure 2: Out-of-distribution evaluation: average performance on OOD environments. Here, we present the average performance across tasks for each domain. Full results for each task are in Table \ref{['tab:ood_full']} in Appendix \ref{['app:ood_full_results']}.
Figure 3: Robustness evaluation aggregated CIs.
...and 18 more figures

Theorems & Definitions (3)

Proposition 1
Lemma 1: Simulation Lemma kearns2002near
Proposition 1

Reward-Free Curricula for Training Robust World Models

TL;DR

Abstract

Reward-Free Curricula for Training Robust World Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (23)

Theorems & Definitions (3)