Table of Contents
Fetching ...

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, Han Zhao, Shangke Lyu, Zhaoxin Fan, Haoang Li, Ran Cheng, Cheng Chi, Huibin Ge, Yaozhi Luo, Donglin Wang

Abstract

Video action models are an appealing foundation for Vision--Language--Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization. The homepage is https://vampo-robot.github.io/VAMPO/.

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

Abstract

Video action models are an appealing foundation for Vision--Language--Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization. The homepage is https://vampo-robot.github.io/VAMPO/.
Paper Structure (19 sections, 19 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 19 sections, 19 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overall of VAMPO. Our post-training framework introduces reinforcement learning from verified rewards in place of the surrogate objective in video action models, enabling direct optimization of task-specific goals in training video prediction model (VPM). This approach improves the accuracy of VPM's predictive visual representations, leading to enhanced action generation and task performance. Notably, our method demonstrates significant improvements not only in simulated environments but also in real-world scenarios, showcasing its robustness and versatility across diverse settings.
  • Figure 2: Overview of the VAMPO training paradigm. In the pretraining stage, the video prediction model (VPM) and action generation model (AGM) are trained on expert demonstrations. In the policy optimization stage, the VPM generates future latents via a hybrid denoising process, using SDE-style stochasticity only at the first step and ODE-based denoising for the remaining steps. Verified rewards are computed by comparing predicted latents with expert latents, and GRPO is used to optimize the VPM toward more precise, control-relevant visual dynamics for downstream action generation.
  • Figure 3: Evaluation on Visual Dynamics. The figure reports the L1 evaluation between predicted latents and ground-truth latents over training steps, and VAMPO exhibits improved alignment with expert dynamics, leading to hallucination suppression, planning correction, and action rectification.
  • Figure 4: Qualitative visualization across multiple benchmarks. The figure presents improved action prediction and manipulation quality achieved by VAMPO on real-world platforms (Agibot Genie 01, Flexiv dual-arm robot, VidowX).
  • Figure 5: Real-world evaluation across multiple task benchmarks. The figure reports performance on three manipulation benchmarks---grasping in clutter, single-arm pick-and-place, and bimanual grasp-and-place---on the Agibot Genie 01 platform. VAMPO (Ours) achieves the best performance across all tasks.
  • ...and 3 more figures