Table of Contents
Fetching ...

World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, Dongbin Zhao

Abstract

Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines.

World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation

Abstract

Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines.

Paper Structure

This paper contains 19 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed World4RL framework. Stage 1 (Pre-training) trains the diffusion transition model on task-agnostic data, optimizes the reward classifier on task-specific success-annotated data, and initializes the policy through imitation learning with expert demonstrations. Stage 2 (Policy Optimization) freezes the pre-trained world model and employs reinforcement learning entirely within imagined rollouts.
  • Figure 2: Visualization of predicted rollouts on the Coffee-Pull-v2 task. The ground truth (GT) trajectory corresponds to a failed execution, where the robot does not successfully pull the cup. World4RL accurately models this failure trajectory, faithfully capturing the underlying dynamics, while baseline models (NWM bar2024navigationworldmodels, iVideoGPT wu2024ivideogpt, and DiWA chandra2025diwa) incorrectly generate successful executions.
  • Figure 3: Comparison of online sample efficiency. World4RL achieves comparable performance on fixed datasets, whereas RLPD and Uni-O4 require over additional 300k online steps.
  • Figure 4: Real World Tasks
  • Figure 5: Ablations on policy optimization performance. We compare the full method with two variants: w/o action std clipping and w/o random rollouts.