Table of Contents
Fetching ...

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo

TL;DR

VideoVLA reframes robot manipulation by repurposing large pre-trained video generators as generalizable VLA manipulators. It jointly predicts action sequences and imagined future visuals conditioned on language and current observation, using a multi-modal Diffusion Transformer and DDPM losses. Empirical results show strong in-domain performance and robust generalization to novel objects and cross-embodiment skills in both simulation and real-world settings, with a notable correlation between imagination quality and task success. The work suggests a scalable path toward more general robot intelligence by leveraging generative video models for planning and perception-driven action.

Abstract

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

VideoVLA: Video Generators Can Be Generalizable Robot Manipulators

TL;DR

VideoVLA reframes robot manipulation by repurposing large pre-trained video generators as generalizable VLA manipulators. It jointly predicts action sequences and imagined future visuals conditioned on language and current observation, using a multi-modal Diffusion Transformer and DDPM losses. Empirical results show strong in-domain performance and robust generalization to novel objects and cross-embodiment skills in both simulation and real-world settings, with a notable correlation between imagination quality and task success. The work suggests a scalable path toward more general robot intelligence by leveraging generative video models for planning and perception-driven action.

Abstract

Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.

Paper Structure

This paper contains 19 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Illustration of VideoVLA. Given a language instruction and the current visual observation, VideoVLA jointly predicts the appropriate sequence of next actions and generates video content that illustrates how these actions will influence physical interactions in the environment. In addition to delivering strong performance on in-domain tasks, VideoVLA demonstrates robust generalization to novel objects and unseen skills. This capability stems from its use of pre-trained video generation models—distinct from prior vision-language-action approaches team2024octoo2023open_x_embodimentkim2024openvlali2024cogactwang2024HPTcheang2024gr2liu2024rdt that primarily rely on pre-trained vision-language understanding models—as well as its dual-objective strategy.
  • Figure 2: Overview of VideoVLA. (a) The text encoder converts the language instruction into a fixed-length token sequence, while the video encoder transforms a video clip into a sequence of frame latents, where the first latent corresponds to the first frame (i.e., the current visual observation). (b) VideoVLA adopts a Diffusion Transformer peebles2023DIT architecture that conditions on the encoded language tokens and the first frame latent to jointly predict the next action chunk required to accomplish the task, along with the future frame latents that represent the anticipated visual outcomes of executing that action chunk. The video decoder, highlighted in pink, is optional and only used when visualizing the imagined future frames.
  • Figure 3: Each sub-figure illustrates the relationship between robot motion similarity—comparing visual imaginations with actual executions—and task success. Each point represents either a successful or failed execution. A higher robot motion similarity corresponds to an increased likelihood of successful execution. The plots show aggregated statistics across tasks in the SIMPLER environment using (a) the Google robot and (b) the WidowX robot.
  • Figure 4: Visualization of VideoVLA’s predicted visual imaginations and corresponding real-world executions during task completion, demonstrating a strong correlation between imagined and actual outcomes. Additional visualizations are provided in the appendix.
  • Figure 5: Visualizations of VideoVLA’s predicted visual imaginations and the corresponding executions during task completion in real-world experiments.
  • ...and 1 more figures