World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy
Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, Mike Zheng Shou
TL;DR
World-VLA-Loop tackles the challenge of unreliable action-grounding in video-based robotic world models by introducing a state-aware, action-conditioned world simulator trained with a Success and Near-Success Dataset (SANS). The method couples world-model refinement with RL post-training of VLA policies in a closed loop, where failure rollouts inform subsequent world-model updates, and integrates a reward head into the diffusion-based video model for aligned outcomes. Empirical results show high visual and reward alignment, with OpenVLA-OFT policies achieving notable gains on LIBERO benchmarks and real-world tasks after RL within the simulator, and further improvements through iterative data augmentation. This approach reduces reliance on costly real-world interactions and demonstrates a practical, generalizable pathway for real-world robotics through co-evolving world models and policy learning.
Abstract
Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the SANS dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics. Project page: https://showlab.github.io/World-VLA-Loop/.
