Reward-Free Curricula for Training Robust World Models
Marc Rigter, Minqi Jiang, Ingmar Posner
TL;DR
Addresses training robust world models under reward-free exploration by formulating Reward-Free Minimax Regret and deriving an upper bound that links regret to the maximum latent-dynamics error across environment instantiations. This theoretical insight motivates WAKER, an active curriculum that selects environment settings using ensemble disagreement to minimize the worst-case latent-transition error, with data gathered both from real environments and imagined rollouts in a DreamerV2-based world model. The approach is instantiated in pixel-based domains and evaluated against baselines like Domain Randomisation and oracle curricula, demonstrating improved robustness and better out-of-distribution generalisation without requiring rewards during exploration. The work provides a principled, reward-free pathway to scale robust world-model pretraining and zero-shot task adaptation in diverse, unseen environments.
Abstract
There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.
