Table of Contents
Fetching ...

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Zhongru Zhang, Chenghan Yang, Qingzhou Lu, Yanjiang Guo, Jianke Zhang, Yucheng Hu, Jianyu Chen

Abstract

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.

Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

Abstract

Video generation models have advanced rapidly and are beginning to show a strong understanding of physical dynamics. In this paper, we investigate how far an advanced video generation model such as Veo-3 can support generalizable robotic manipulation. We first study a zero-shot approach in which Veo-3 predicts future image sequences from current robot observations, while an inverse dynamics model IDM recovers the corresponding robot actions. The IDM is trained solely on random-play data, requiring neither human supervision nor expert demonstrations. The key intuition is that, if a video model can generate physically plausible future motions in image space, an IDM can translate those visual trajectories into executable robot actions. We evaluate this "Veo-3+IDM" approach in both simulation and the real world using a high-dimensional dexterous hand. We find that, owing to the strong generalization capability of frontier video models, Veo-3+IDM can consistently generate approximately correct task-level trajectories. However, its low-level control accuracy remains insufficient to solve most tasks reliably. Motivated by this observation, we develop a hierarchical framework, Veo-Act, which uses Veo-3 as a high-level motion planner and a VLA policy as the low-level executor, significantly improving the instruction-following performance of a state-of-the-art vision-language-action policy. Overall, our results suggest that, as video generation models continue to improve, video models can be a valuable component for generalizable robot learning.

Paper Structure

This paper contains 41 sections, 18 equations, 12 figures, 5 tables, 3 algorithms.

Figures (12)

  • Figure 1: Comparison of three control pipelines. (a) VLA is adapted from a VLM by introducing new action modality, but this adaptation sacrifices some degree of generalization. (b) “Video model + IDM” generalize well, but lacks accuracy in low-level control. (c) Our Veo-Act is a hierarchical pipeline that automatically switch between video planner and VLA, combining the strengths of both approaches.
  • Figure 2: Three Paradigms of inference. The generated trajectories are shown in the top row as the generated video, where the last frame indicates task success. The second row shows trajectories executed by pure IDM inference. The third row shows trajectories executed by the Veo-Act architecture, but it locks into the low-level policy after the first switch. The fourth row shows trajectories executed by the full Veo-Act setup. Here, 0 denotes the instruction-following stage and 1 denotes the interaction stage.
  • Figure 3: Overview of the hierarchical planning and control pipeline. Starting from the first observation $I_0$ and a language prompt, a video model generates a future visual trajectory $I^{*}_{0:n}$. A multi-head inverse dynamics model converts this trajectory into a planned action chunk $a^{*}_{0:n-1}$ and a predicted gate sequence, then a smoother produces $\bar{a}^{*}_{0:n-1}$. During execution, the controller pops actions from the queue to follow the plan, while the IDM interaction detection head evaluates a real-time gate value $G_t$ from current observations. If $G_t$ exceeds a threshold, control switches from instruction following stage to a reactive low-level policy for dexterous interaction; otherwise it continues consuming the planned action queue. The system can switch back and resume the remaining planned actions to complete the task.
  • Figure 4: Multi-head IDM training pipeline. We collect frame-pair samples in simulation and on the real robot, where each sample includes consecutive observations $(I_{t-1}, I_t)$, the executed action $a_t$, and a binary interaction label $g_t$ (grasp=1, non-grasp=0). We apply observation-level augmentation (STEM-OB) to the image sequence to improve robustness and reduce sim-to-real gap, and feed the augmented frame pairs with state $s_t$ into a DINOv3-based encoder. The multi-head IDM predicts an action $\hat{a}_t$ and a gate value $\hat{G}_t \in [0,1]$ using two separate MLP heads. The model is trained end-to-end with a Huber loss for action regression and a binary cross-entropy loss for interaction detection, jointly optimizing $\mathcal{L}=\lambda_{\mathrm{act}}\mathcal{L}_{\mathrm{act}}(a_t,\hat{a}_t)+\lambda_{\mathrm{gate}}\mathcal{L}_{\mathrm{gate}}(g_t,\hat{G}_t)$.
  • Figure 5: Simulation success rates. Instruction-following is yellow and overall is green. C-b: Baseline under Control; E-b: Baseline under Experimental; C-v: Veo-Act under Control; E-v: Veo-Act under Experimental.
  • ...and 7 more figures