Table of Contents
Fetching ...

CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Ziyuan Liu, Abhinav Valada

TL;DR

CoVAR addresses the lack of paired video–action data in robotic learning by jointly generating video and actions with a parallel action diffusion model. It preserves video-domain knowledge through a dedicated Action DiT, and employs Bridge Attention to enable robust cross-modal information exchange, plus an action refinement module to improve precision on low-resolution data. Extensive experiments on Calvin, Libero90, and real UR5 tasks show advancements in video quality and action success rates over both two-stage and joint baselines, validating the approach's data efficiency and practicality. The framework offers a scalable direction for leveraging large-scale video data to train more capable robotic policies, especially under limited labeled data.

Abstract

We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion

TL;DR

CoVAR addresses the lack of paired video–action data in robotic learning by jointly generating video and actions with a parallel action diffusion model. It preserves video-domain knowledge through a dedicated Action DiT, and employs Bridge Attention to enable robust cross-modal information exchange, plus an action refinement module to improve precision on low-resolution data. Extensive experiments on Calvin, Libero90, and real UR5 tasks show advancements in video quality and action success rates over both two-stage and joint baselines, validating the approach's data efficiency and practicality. The framework offers a scalable direction for leveraging large-scale video data to train more capable robotic policies, especially under limited labeled data.

Abstract

We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Architectural comparison with prior methods. (a) Two-stage Model vppdreamgenunipiavdcgenieEnvisioner (b) Joint Model uvauwmpad (c) CoVAR. Different from other methods, our framework extends video DiT by attaching a dedicated DiT for action generation, meanwhile allowing the action branch to share information with the pretrained video backbone via our proposed Bridge Attention.
  • Figure 2: Overview of CoVAR. (A) It is built on a video diffusion backbone with a parallel Action DiT to generate actions. (B) The two modalities interact through Bridge Attention. (C) For low-resolution datasets, an Action Refinement Module is introduced.
  • Figure 3: Visualization of generated video-action pair. Red lines denote groundtruth actions as reference. Blue lines denote our generated actions. The generated videos align well with the text instructions, and the paired actions closely match the reference ground truth to achieve the tasks.
  • Figure 4: Comparison of generated videos with baselines. In comparison to other baselines, our model generates video content of objects and robotic arms with reduced artifacts, yielding clearer and more realistic results. The rollout shows strong alignment between the generated video and the corresponding action.
  • Figure 5: Rollout comparison between our model and the variant without action refinement. The model without action refinement produces coarse actions that merely reflect the general trend of the task but remain imprecise; action refinement enhances precision and enables successful completion.
  • ...and 2 more figures