Coupled Local and Global World Models for Efficient First Order RL
Joseph Amigo, Rooholla Khorrambakht, Nicolas Mansard, Ludovic Righetti
TL;DR
Robotic RL often relies on simulators, which creates sim-to-real gaps and pixel-level modeling challenges. The paper introduces a simulator-free FoG-MBRL framework that couples a high-fidelity global diffusion world model for forward rollouts with a lightweight local latent RSSM for backward gradients, learned from real-world image data. Gradients are computed via first-order optimization by decoupling forward trajectories from backward differentiation, evaluating Jacobians at forward next states and formalized by the DMO-SAPO objective shown as $L^{DMO-SAPO}_\pi(\boldsymbol{\theta}) = \mathbb{E}_{\tau \sim \pi_\theta, f}[\sum_{h=1}^{H-1} \gamma^h ( r(s_h,a_h) + \alpha \mathcal{H}_{\pi}[a_h|s_h] ) + \gamma^H V^{\pi_\theta}_{\psi}(s_H) ]$ and the policy gradient $\nabla_\theta G(\theta) = \sum_{t=0}^{\infty} \gamma^t [ \frac{\partial r}{\partial s}|_{(s_t,a_t)} \frac{d s_t}{d\theta} + \frac{\partial r}{\partial a}|_{(s_t,a_t)} \frac{d a_t}{d\theta} ]$. The approach, validated on real Push-T and ego-centric Push Cube tasks, delivers superior sample and time efficiency over PPO and demonstrates robust zero-shot transfer without hand-crafted simulators, highlighting the practical potential of learning inside data-driven world models for challenging vision-based manipulation.
Abstract
World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.
