Table of Contents
Fetching ...

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, Qing Zhang

TL;DR

Vision-Language-Action models suffer from data scarcity when trained by imitation, and RL post-training in the real world is often infeasible due to non-resettable state transitions. World-Env replaces physical interaction with a video-based world simulator and a VLM-guided instant reflector, enabling safe, data-efficient post-training with continuous rewards and real-time termination signals. The approach augments demonstrations with self-exploration data, and trains the world model offline while learning the policy via a LOOP PPO framework with a Laplace action distribution for uncertainty. Experiments on LIBERO demonstrate strong data efficiency, achieving high task success with as few as five demonstrations and providing practical benefits for safe, scalable post-training in resource-constrained domains.

Abstract

Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/amap-cvlab/world-env.

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

TL;DR

Vision-Language-Action models suffer from data scarcity when trained by imitation, and RL post-training in the real world is often infeasible due to non-resettable state transitions. World-Env replaces physical interaction with a video-based world simulator and a VLM-guided instant reflector, enabling safe, data-efficient post-training with continuous rewards and real-time termination signals. The approach augments demonstrations with self-exploration data, and trains the world model offline while learning the policy via a LOOP PPO framework with a Laplace action distribution for uncertainty. Experiments on LIBERO demonstrate strong data efficiency, achieving high task success with as few as five demonstrations and providing practical benefits for safe, scalable post-training in resource-constrained domains.

Abstract

Vision-Language-Action (VLA) models trained via imitation learning suffer from significant performance degradation in data-scarce scenarios due to their reliance on large-scale demonstration datasets. Although reinforcement learning (RL)-based post-training has proven effective in addressing data scarcity, its application to VLA models is hindered by the non-resettable nature of real-world environments. This limitation is particularly critical in high-risk domains such as industrial automation, where interactions often induce state changes that are costly or infeasible to revert. Furthermore, existing VLA approaches lack a reliable mechanism for detecting task completion, leading to redundant actions that reduce overall task success rates. To address these challenges, we propose World-Env, an RL-based post-training framework that replaces physical interaction with a low-cost, world model-based virtual simulator. World-Env consists of two key components: (1) a video-based world simulator that generates temporally consistent future visual observations, and (2) a vision-language model (VLM)-guided instant reflector that provides continuous reward signals and predicts action termination. This simulated environment enables VLA models to safely explore and generalize beyond their initial imitation learning distribution. Our method achieves notable performance gains with as few as five expert demonstrations per task. Experiments on complex robotic manipulation tasks demonstrate that World-Env effectively overcomes the data inefficiency, safety constraints, and inefficient execution of conventional VLA models that rely on real-world interaction, offering a practical and scalable solution for post-training in resource-constrained settings. Our code is available at https://github.com/amap-cvlab/world-env.

Paper Structure

This paper contains 31 sections, 11 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of three VLA training paradigms: (a) Imitation learning suffers from poor generalization under data scarcity. (b) Prior RL-based post-training methods require real-world interaction, which is often infeasible due to non-resettable state transitions (e.g., object drop or collision). (c) Our proposed World-Env enables post-training via simulated rollouts using a world model, eliminating the need for physical interaction and supporting safe, efficient exploration even with minimal expert demonstrations.
  • Figure 2: Overview of World-Env. Our framework comprises: (1) a Training Data Strategy that augments human demonstrations trajectories with VLA self-explored trajectories to train the World-Env Simulator; (2) an Optimization Loop where the VLA model generates actions, the simulator predicts future observations, and the World-Env Reflector generates feedback; and (3) Reward & End Signal provides trajectory-wise reward and end signals for RL optimization.
  • Figure 3: Comparison between our method and SFT on multi-goal tasks. Note, all results are collected every 5 training steps for three distinct goals.
  • Figure 4: Rendering comparison of world simulator trained with and without extra data.
  • Figure 5: Real-world rendering results of world simulator. We show a video sequence generated by our world simulator in real-world scene.
  • ...and 7 more figures