Table of Contents
Fetching ...

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, Chelsea Finn

TL;DR

VLAW presents an iterative, data-efficient framework that co-improves a vision-language-action policy and an action-conditioned world model. By grounding the world model with limited real-world rollouts and generating large-scale synthetic trajectories, the method enables stable supervised fine-tuning of the VLA policy with a reward model to identify successful trajectories. Real-robot experiments on the DROID platform demonstrate significant performance gains across multiple contact-rich tasks, including large absolute improvements and benefits from synthetic data. The approach provides a scalable path for improving generalist robotic policies with limited real data, while connecting to regularized RL through a weighted-projection perspective. Overall, VLAW demonstrates the value of iteratively refining world models and policies using real data to expand capabilities in physically complex manipulation tasks.

Abstract

The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w

VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model

TL;DR

VLAW presents an iterative, data-efficient framework that co-improves a vision-language-action policy and an action-conditioned world model. By grounding the world model with limited real-world rollouts and generating large-scale synthetic trajectories, the method enables stable supervised fine-tuning of the VLA policy with a reward model to identify successful trajectories. Real-robot experiments on the DROID platform demonstrate significant performance gains across multiple contact-rich tasks, including large absolute improvements and benefits from synthetic data. The approach provides a scalable path for improving generalist robotic policies with limited real data, while connecting to regularized RL through a weighted-projection perspective. Overall, VLAW demonstrates the value of iteratively refining world models and policies using real data to expand capabilities in physically complex manipulation tasks.

Abstract

The goal of this paper is to improve the performance and reliability of vision-language-action (VLA) models through iterative online interaction. Since collecting policy rollouts in the real world is expensive, we investigate whether a learned simulator-specifically, an action-conditioned video generation model-can be used to generate additional rollout data. Unfortunately, existing world models lack the physical fidelity necessary for policy improvement: they are predominantly trained on demonstration datasets that lack coverage of many different physical interactions (particularly failure cases) and struggle to accurately model small yet critical physical details in contact-rich object manipulation. We propose a simple iterative improvement algorithm that uses real-world roll-out data to improve the fidelity of the world model, which can then, in turn, be used to generate supplemental synthetic data for improving the VLA model. In our experiments on a real robot, we use this approach to improve the performance of a state-of-the-art VLA model on multiple downstream tasks. We achieve a 39.2% absolute success rate improvement over the base policy and 11.6% improvement from training with the generated synthetic rollouts. Videos can be found at this anonymous website: https://sites.google.com/view/vla-w
Paper Structure (20 sections, 12 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 12 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: VLA model roll-outs in the real world are time-consuming and unscalable. In VLAW, we first learn an action-conditioned world model using limited real-world online rollouts, which in turn generates large-scale synthetic data in imagination.
  • Figure 2: Policy online rollout data can help ground the pretrained world model in downstream tasks. Once the world model is grounded, we can generate massive data for policy learning.
  • Figure 3: Detailed pipeline for VLAW: (1) We first roll out the policy in the real world to collect a small set of online trajectories. (2) We then fine-tune a pretrained action-conditioned world model on these policy rollout data, grounding the world model in the target tasks and improving its predictive fidelity. (3) Using the resulting world model, we generate large-scale synthetic trajectories through closed-loop interactions between the policy and the world model. (4) Finally, we optimize the VLA policy using both real-world and synthetic data, with reward automatically assessed by a vision–language reward model.
  • Figure 4: Our experiments are conducted on the DROID platform and cover five task categories, as illustrated in the figure. These tasks involve complex physical interactions, including frequent contact and deformable objects, which are challenging to model in traditional simulations.
  • Figure 5: Examples of long-horizon policy-in-the-loop rollouts within the world model starting from the initial observation. The policy $\pi_{0.5}$ is rolled out for 20 iterations (20 seconds). The post-trained world model accurately captures contact-rich physical dynamics. Top: scooping peanuts into a new bowl. Bottom: erasing marker drawings with a tissue.
  • ...and 4 more figures