WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo
TL;DR
WMPO addresses the brittleness and data-hungry nature of Vision-Language-Action (VLA) policies by grounding on-policy reinforcement learning in a high-fidelity, action-conditioned pixel-space world model. It aligns the world model with policy behavior, employs clip-based GRPO for stable, on-policy updates, and uses a lightweight reward model to provide sparse feedback, enabling fully imaginary training without real-robot interaction. The approach yields superior sample efficiency, robust generalization to distribution shifts, and emergent self-correction behaviors, demonstrated in simulation and real-robot experiments, including lifelong learning scenarios. This framework bridges pretrained VLA representations with realistic visual dynamics, offering a scalable path for robust, failure-capable robotic manipulation.
Abstract
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
