Coupled Local and Global World Models for Efficient First Order RL

Joseph Amigo; Rooholla Khorrambakht; Nicolas Mansard; Ludovic Righetti

Coupled Local and Global World Models for Efficient First Order RL

Joseph Amigo, Rooholla Khorrambakht, Nicolas Mansard, Ludovic Righetti

TL;DR

Robotic RL often relies on simulators, which creates sim-to-real gaps and pixel-level modeling challenges. The paper introduces a simulator-free FoG-MBRL framework that couples a high-fidelity global diffusion world model for forward rollouts with a lightweight local latent RSSM for backward gradients, learned from real-world image data. Gradients are computed via first-order optimization by decoupling forward trajectories from backward differentiation, evaluating Jacobians at forward next states and formalized by the DMO-SAPO objective shown as $L^{DMO-SAPO}_\pi(\boldsymbol{\theta}) = \mathbb{E}_{\tau \sim \pi_\theta, f}[\sum_{h=1}^{H-1} \gamma^h ( r(s_h,a_h) + \alpha \mathcal{H}_{\pi}[a_h|s_h] ) + \gamma^H V^{\pi_\theta}_{\psi}(s_H) ]$ and the policy gradient $\nabla_\theta G(\theta) = \sum_{t=0}^{\infty} \gamma^t [ \frac{\partial r}{\partial s}|_{(s_t,a_t)} \frac{d s_t}{d\theta} + \frac{\partial r}{\partial a}|_{(s_t,a_t)} \frac{d a_t}{d\theta} ]$. The approach, validated on real Push-T and ego-centric Push Cube tasks, delivers superior sample and time efficiency over PPO and demonstrates robust zero-shot transfer without hand-crafted simulators, highlighting the practical potential of learning inside data-driven world models for challenging vision-based manipulation.

Abstract

World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as complex sensory information, such as visual perception, in situations where standard simulators struggle. However, these models are computationally complex to evaluate, posing a challenge for popular RL approaches that have been successfully used with simulators to solve complex locomotion tasks but yet struggle with manipulation. This paper introduces a method that bypasses simulators entirely, training RL policies inside world models learned from robots' interactions with real environments. At its core, our approach enables policy training with large-scale diffusion models via a novel decoupled first-order gradient (FoG) method: a full-scale world model generates accurate forward trajectories, while a lightweight latent-space surrogate approximates its local dynamics for efficient gradient computation. This coupling of a local and global world model ensures high-fidelity unrolling alongside computationally tractable differentiation. We demonstrate the efficacy of our method on the Push-T manipulation task, where it significantly outperforms PPO in sample efficiency. We further evaluate our approach through an ego-centric object manipulation task with a quadruped. Together, these results demonstrate that learning inside data-driven world models is a promising pathway for solving hard-to-model RL tasks in image space without reliance on hand-crafted physics simulators.

Coupled Local and Global World Models for Efficient First Order RL

TL;DR

and the policy gradient

. The approach, validated on real Push-T and ego-centric Push Cube tasks, delivers superior sample and time efficiency over PPO and demonstrates robust zero-shot transfer without hand-crafted simulators, highlighting the practical potential of learning inside data-driven world models for challenging vision-based manipulation.

Abstract

Paper Structure (25 sections, 6 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Work
World Models and Policy Learning.
Gradient Estimation for Policy Learning.
Diffusion for Efficient Control.
Method
Background
Notations
Model-based RL with first order gradients
Decoupling
Offline Data Collection
Forward Models (Global)
Dynamics Models
Reward Models
Backward Models (Local)
...and 10 more sections

Figures (7)

Figure 1: Overview of the proposed approach. Global world/reward models are learned from play/demonstration data (Steps 1 and 2). Then, local world/reward models are pre-trained (Step 3). Policy is optimized with the DMO algorithm (Step 4): forward simulation uses the diffusion-based global world model in pixel space, trained on real robot data, to produce high-fidelity rollouts. The backward pass uses gradients computed from a local world model operating in a low-dimensional latent space. The local world/reward model is further fine-tuned during policy optimization.
Figure 2: (a) Unrolling of real and model-predicted trajectories comparing the real trajectory (top), the diamond diffusion trajectory (middle), and the DreamerV3 trajectory (bottom) at time steps $0, 5, 10, \dots, 60$ ($12$ s). (b) Unrolling of real and model-predicted trajectories comparing the real trajectory (top), the DreamerV3 trajectory (middle), and the DreamerV4 diffusion trajectory (bottom) at time steps $0, 5, 10, \dots, 60$ ($12$ s), with both models initialized with zero context, and the cube initially occluded. Note that the local model violates object permanency by spawning the cube throughout the rollout.
Figure 3: (a) Efficiency comparison on the Push-T task: sample efficiency of DMO (8M samples) versus PPO (40M samples) and the No Diffusion ablation (left), and corresponding time efficiency comparison (right). (b) Efficiency comparison on the Push Cube task: sample efficiency of DMO (4M samples) versus PPO (25M samples) and the No Diffusion ablation (left), and time efficiency comparison between DMO, PPO and No Diffusion (right).
Figure 4: Three real-robot Push-T trajectories executed by the policy learned with our approach.
Figure 5: Task completion comparison. Top: The Behavior Cloning (ACT) policy successfully approaches the cube but stops pushing before the object enters the goal, reflecting the sub-optimal demonstration distribution. Bottom: Our RL Policy (DMO) generalizes beyond the demonstrations, learning to push the cube fully into the net to maximize the task reward.
...and 2 more figures

Coupled Local and Global World Models for Efficient First Order RL

TL;DR

Abstract

Coupled Local and Global World Models for Efficient First Order RL

Authors

TL;DR

Abstract

Table of Contents

Figures (7)