Table of Contents
Fetching ...

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, Mike Zheng Shou

TL;DR

World-VLA-Loop tackles the challenge of unreliable action-grounding in video-based robotic world models by introducing a state-aware, action-conditioned world simulator trained with a Success and Near-Success Dataset (SANS). The method couples world-model refinement with RL post-training of VLA policies in a closed loop, where failure rollouts inform subsequent world-model updates, and integrates a reward head into the diffusion-based video model for aligned outcomes. Empirical results show high visual and reward alignment, with OpenVLA-OFT policies achieving notable gains on LIBERO benchmarks and real-world tasks after RL within the simulator, and further improvements through iterative data augmentation. This approach reduces reliance on costly real-world interactions and demonstrates a practical, generalizable pathway for real-world robotics through co-evolving world models and policy learning.

Abstract

Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the SANS dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics. Project page: https://showlab.github.io/World-VLA-Loop/.

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

TL;DR

World-VLA-Loop tackles the challenge of unreliable action-grounding in video-based robotic world models by introducing a state-aware, action-conditioned world simulator trained with a Success and Near-Success Dataset (SANS). The method couples world-model refinement with RL post-training of VLA policies in a closed loop, where failure rollouts inform subsequent world-model updates, and integrates a reward head into the diffusion-based video model for aligned outcomes. Empirical results show high visual and reward alignment, with OpenVLA-OFT policies achieving notable gains on LIBERO benchmarks and real-world tasks after RL within the simulator, and further improvements through iterative data augmentation. This approach reduces reliance on costly real-world interactions and demonstrates a practical, generalizable pathway for real-world robotics through co-evolving world models and policy learning.

Abstract

Recent progress in robotic world models has leveraged video diffusion transformers to predict future observations conditioned on historical states and actions. While these models can simulate realistic visual outcomes, they often exhibit poor action-following precision, hindering their utility for downstream robotic learning. In this work, we introduce World-VLA-Loop, a closed-loop framework for the joint refinement of world models and Vision-Language-Action (VLA) policies. We propose a state-aware video world model that functions as a high-fidelity interactive simulator by jointly predicting future observations and reward signals. To enhance reliability, we introduce the SANS dataset, which incorporates near-success trajectories to improve action-outcome alignment within the world model. This framework enables a closed-loop for reinforcement learning (RL) post-training of VLA policies entirely within a virtual environment. Crucially, our approach facilitates a co-evolving cycle: failure rollouts generated by the VLA policy are iteratively fed back to refine the world model precision, which in turn enhances subsequent RL optimization. Evaluations across simulation and real-world tasks demonstrate that our framework significantly boosts VLA performance with minimal physical interaction, establishing a mutually beneficial relationship between world modeling and policy learning for general-purpose robotics. Project page: https://showlab.github.io/World-VLA-Loop/.
Paper Structure (20 sections, 1 equation, 8 figures, 5 tables)

This paper contains 20 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Paradigms for world-model-based VLA reinforcement learning. Comparison of existing methodologies: current approaches typically rely on reconstructing the environment within 3D world or training video world models that simulate the environment. To address the imprecise action-following inherent in existing video-based simulators, we propose World-VLA-Loop, a closed-loop paradigm that jointly optimizes the world model and the VLA policy to iteratively enhance the performance and grounding of both. (b) We show that the real-world policy success rate is improved by 36.7% after two iterations of joint optimization with VLA model and world model.
  • Figure 2: Current world models struggle to accurately simulate failure cases stemming from minor action errors. This is primarily due to their inability in modeling fine-grained interaction dynamics and precise action conditioning. In the figure, the transparent overlays denote the ground-truth gripper trajectories, illustrating cases where the robot fails to grasp the object.
  • Figure 3: Full pipeline of our proposed framework. The process comprises four phases: (1) Curating the SANS dataset via manual teleoperation and policy rollouts; (2) Pretraining the action-conditioned world model on SANS with joint reward and video supervision; (3) Executing VLA policy rollouts within the world model to perform GRPO optimization; and (4) Deploying the refined policy to collect new failure and success data for further SANS augmentation. This cycle enables the joint optimization of the world model and the VLA policy, iteratively enhancing both performance.
  • Figure 4: Success rate improvements along World-VLA-Loop RL training steps.
  • Figure 5: Examples of world model generated rollouts and actual execution videos by both SFT policy and RL post-trained policy.
  • ...and 3 more figures