PWM: Policy Learning with Multi-Task World Models

Ignat Georgiev; Varun Giridhar; Nicklas Hansen; Animesh Garg

PWM: Policy Learning with Multi-Task World Models

Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg

TL;DR

This work tackles multi-task reinforcement learning across diverse embodiments by proposing PWM, a model-based approach that pretrains a large, smooth world model on offline data and then extracts per-task policies via first-order gradient optimization through a differentiable simulator. By prioritizing world-model regularization over accuracy, PWM achieves smoother optimization landscapes, enabling efficient policy learning even in high-dimensional tasks (up to $m=152$ actions) and in multi-task settings (MT30 and MT80) without online planning. The framework demonstrates strong performance improvements over baselines that use ground-truth dynamics or require online planning, achieving per-task policy learning in under 10 minutes and enabling scalable multi-task deployment. These results provide a practical pathway toward fast, scalable, and generalizable robotic control, highlighting the value of differentiable world models as sources of low-variance, informative policy gradients. The work also outlines limitations related to data requirements and per-task retraining, and it contributes reproducible code and data to support further research.

Abstract

Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World model methods offer scalability by learning a simulation of the environment but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without relying on costly online planning. Visualizations and code are available at https://www.imgeorgiev.com/pwm/.

PWM: Policy Learning with Multi-Task World Models

TL;DR

actions) and in multi-task settings (MT30 and MT80) without online planning. The framework demonstrates strong performance improvements over baselines that use ground-truth dynamics or require online planning, achieving per-task policy learning in under 10 minutes and enabling scalable multi-task deployment. These results provide a practical pathway toward fast, scalable, and generalizable robotic control, highlighting the value of differentiable world models as sources of low-variance, informative policy gradients. The work also outlines limitations related to data requirements and per-task retraining, and it contributes reproducible code and data to support further research.

Abstract

Paper Structure (21 sections, 14 equations, 20 figures, 4 tables, 1 algorithm)

This paper contains 21 sections, 14 equations, 20 figures, 4 tables, 1 algorithm.

Introduction
Background
Policy optimization through world models
Learning through contact
Learning with chaotic dynamics
PWM: Policy learning with multi-task World Models
Experimental results
Contact-rich single tasks
Multi-task world-model
Ablations
Related work
Conclusion
Ball-wall example details
Double pendulum example details
Implementation details and hyper-parameters
...and 6 more sections

Figures (20)

Figure 1: We propose PWM, a new method for multi-task RL that utilizes pre-trained world models to learn policies for each task. When sufficiently regularized, these world models induce smooth optimization landscapes, which allows for efficient first-order optimization. Our approach can solve tasks in <10 minutes and achieves higher rewards in both single-task and multi-task environments.
Figure 2: Ball-wall pedagogical example. The left figure visualizes the problem. The middle figure shows the problem landscape induced by each model. $J(\theta)$ shows the true underlying function and the two other are MLPs with different activation functions. We minimize each of these problems using gradient descent and starting at $\theta=-\pi$ (marker $\times$). The colored crosses represent the solutions converged to for each model. The right table shows the model approximation error during training and the optimality gap $| J(\theta^*) - J(\hat{\theta}) |$ between the global minimum $\theta^*$ and the solution found for each model $\hat{\theta}$.
Figure 3: Double pendulum pedagogical example. The middle figure evaluates the variance of policy gradient estimates over $N=100$ Monte-Carlo samples for varying horizons $H$. The right figure shows the same data but plots the Expected Signal-to-Noise ratio (ESNR) with higher values translating to more useful gradients. These results suggests that world models trained over long horizon trajectories provide more useful gradients. Note that $H=3$ and $H=16$ in the figure legends refer to the training horizon of the models.
Figure 4: High-dimensional single-task environments (left to right): Hopper, Ant, Anymal, Humanoid and SNU Humanoid. Our method successfully learns tasks with up to $m=152$ continuous action dimensions. Additional 80 multi-task environments used in this paper are listed in Appendix \ref{['app:multi-task-app']}
Figure 5: Aggregate results from high-dimensional locomotion tasks where each agent is trained to solve just that task (i.e. specialist). The left figure summarizes rewards achieved at the end of training using 50% IQM for the solid lines and 95% CI as suggested by agarwal2021deep, as well as mean for the dashed lines. We see that PWM achieves higher rewards than our main baselines TD-MPC2 and SHAC. The right figure shows score distributions across all tasks which lets us understand the performance variability of each approach.
...and 15 more figures

PWM: Policy Learning with Multi-Task World Models

TL;DR

Abstract

PWM: Policy Learning with Multi-Task World Models

Authors

TL;DR

Abstract

Table of Contents

Figures (20)