PWM: Policy Learning with Multi-Task World Models
Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg
TL;DR
This work tackles multi-task reinforcement learning across diverse embodiments by proposing PWM, a model-based approach that pretrains a large, smooth world model on offline data and then extracts per-task policies via first-order gradient optimization through a differentiable simulator. By prioritizing world-model regularization over accuracy, PWM achieves smoother optimization landscapes, enabling efficient policy learning even in high-dimensional tasks (up to $m=152$ actions) and in multi-task settings (MT30 and MT80) without online planning. The framework demonstrates strong performance improvements over baselines that use ground-truth dynamics or require online planning, achieving per-task policy learning in under 10 minutes and enabling scalable multi-task deployment. These results provide a practical pathway toward fast, scalable, and generalizable robotic control, highlighting the value of differentiable world models as sources of low-variance, informative policy gradients. The work also outlines limitations related to data requirements and per-task retraining, and it contributes reproducible code and data to support further research.
Abstract
Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World model methods offer scalability by learning a simulation of the environment but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without relying on costly online planning. Visualizations and code are available at https://www.imgeorgiev.com/pwm/.
