Table of Contents
Fetching ...

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo

TL;DR

WMPO addresses the brittleness and data-hungry nature of Vision-Language-Action (VLA) policies by grounding on-policy reinforcement learning in a high-fidelity, action-conditioned pixel-space world model. It aligns the world model with policy behavior, employs clip-based GRPO for stable, on-policy updates, and uses a lightweight reward model to provide sparse feedback, enabling fully imaginary training without real-robot interaction. The approach yields superior sample efficiency, robust generalization to distribution shifts, and emergent self-correction behaviors, demonstrated in simulation and real-robot experiments, including lifelong learning scenarios. This framework bridges pretrained VLA representations with realistic visual dynamics, offering a scalable path for robust, failure-capable robotic manipulation.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

TL;DR

WMPO addresses the brittleness and data-hungry nature of Vision-Language-Action (VLA) policies by grounding on-policy reinforcement learning in a high-fidelity, action-conditioned pixel-space world model. It aligns the world model with policy behavior, employs clip-based GRPO for stable, on-policy updates, and uses a lightweight reward model to provide sparse feedback, enabling fully imaginary training without real-robot interaction. The approach yields superior sample efficiency, robust generalization to distribution shifts, and emergent self-correction behaviors, demonstrated in simulation and real-robot experiments, including lifelong learning scenarios. This framework bridges pretrained VLA representations with realistic visual dynamics, offering a scalable path for robust, failure-capable robotic manipulation.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

Paper Structure

This paper contains 30 sections, 5 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Three different VLA training paradigms: (a) Imitation learning learns from human demonstrations but lacks the ability for learning from failures and self-correction; (b) Real-world RL improves policy through direct interaction but suffers from high sampling costs and difficulty in achieving on-policy RL; (c) WMPO pretrains a world model on large-scale robotic trajectories and fine-tunes it with limited policy behavior data, enabling sample-efficient on-policy RL for VLA without real-world interaction.
  • Figure 2: WMPO starts from an initial state $s_0$. The overall training procedure consists of three components: (1) Imagined Trajectory Generation, where policy model $\pi_{\theta_{\text{old}}}$ and world model $p_\phi$ interact alternately to generate a full imagined trajectory; (2) Trajectory Sampling, where multiple trajectories are sampled and evaluated by the reward model $R_\psi$; and (3) Policy Update, where the policy parameters $\theta$ are optimized via Eq. \ref{['eq:grpo']}. This process is iteratively repeated throughout training.
  • Figure 3: Behavior analysis of the Square task (inserting the square into the stick) shows that, compared with the base policy, WMPO demonstrates the ability to self-correct.
  • Figure 4: (a) For the Square task, we vary the stick’s position from fixed to a random position inside a rectangle. (b) For the StackThree task, we substitute the tabletop background with a gray background. (c) For the ThreePieceAssembly task, we substitute the red base with a dark wooden base.
  • Figure 5: Relative average trajectory length of successful trials across different policies (Base Policy = 100%).
  • ...and 4 more figures