Table of Contents
Fetching ...

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, Dongbin Zhao

TL;DR

WoVR is proposed, a reliable world-model-based reinforcement learning framework for post-training VLA policies that enables stable long-horizon imagined rollouts and effective policy optimization, and shows that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.

Abstract

Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

TL;DR

WoVR is proposed, a reliable world-model-based reinforcement learning framework for post-training VLA policies that enables stable long-horizon imagined rollouts and effective policy optimization, and shows that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.

Abstract

Reinforcement learning (RL) promises to unlock capabilities beyond imitation learning for Vision-Language-Action (VLA) models, but its requirement for massive real-world interaction prevents direct deployment on physical robots. Recent work attempts to use learned world models as simulators for policy optimization, yet closed-loop imagined rollouts inevitably suffer from hallucination and long-horizon error accumulation. Such errors do not merely degrade visual fidelity; they corrupt the optimization signal, encouraging policies to exploit model inaccuracies rather than genuine task progress. We propose WoVR, a reliable world-model-based reinforcement learning framework for post-training VLA policies. Instead of assuming a faithful world model, WoVR explicitly regulates how RL interacts with imperfect imagined dynamics. It improves rollout stability through a controllable action-conditioned video world model, reshapes imagined interaction to reduce effective error depth via Keyframe-Initialized Rollouts, and maintains policy-simulator alignment through World Model-Policy co-evolution. Extensive experiments on LIBERO benchmarks and real-world robotic manipulation demonstrate that WoVR enables stable long-horizon imagined rollouts and effective policy optimization, improving average LIBERO success from 39.95% to 69.2% (+29.3 points) and real-robot success from 61.7% to 91.7% (+30.0 points). These results show that learned world models can serve as practical simulators for reinforcement learning when hallucination is explicitly controlled.
Paper Structure (34 sections, 11 equations, 8 figures, 5 tables)

This paper contains 34 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Hallucination in Closed-Loop World Model Rollouts. The world model imagines a successful grasp (green frames), but real-world execution fails (red frames).To address this critical mismatch, we propose three hallucination-aware mechanisms.
  • Figure 2: Overview of WoVR. WoVR builds a reliability-driven reinforcement learning framework entirely around the learned world model. It first strengthens the world model as a controllable simulator, ensuring rollout-stable and action-responsive generation. On top of this simulator, it designs a reliable interaction protocol via Keyframe-Initialized Rollouts (KIR) and masked GRPO to reduce effective error depth and prevent optimization on hallucinated success. Finally, it maintains policy–model alignment through PACE, which co-evolves the world model with the evolving policy to mitigate distribution shift and preserve simulator reliability.
  • Figure 3: Architecture of the proposed action-conditioned world model. The world model is built upon a video diffusion backbone and conditioned on actions via a dual-channel action injection design, enabling frame-level controllability and stable chunk-by-chunk autoregressive generation for long-horizon imagined rollouts.
  • Figure 4: Visualization of self attention probability map. During the denoising process, many attention heads focus on the first frame of the sequence.
  • Figure 5: Illustration of the effect of Keyframe-Initialized Rollouts (KIR). Starting from the initial state, long-horizon rollouts accumulate prediction errors in early stages, leading to hallucinated success that contradicts the ground-truth failure. In contrast, Keyframe-Initialized Rollouts initialize rollouts near critical states, enabling physically consistent predictions that correctly model failure, which in turn facilitates more efficient and stable policy learning.
  • ...and 3 more figures