Table of Contents
Fetching ...

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita

Abstract

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/

CRAFT: Video Diffusion for Bimanual Robot Data Generation

Abstract

Bimanual robot learning from demonstrations is fundamentally limited by the cost and narrow visual diversity of real-world data, which constrains policy robustness across viewpoints, object configurations, and embodiments. We present Canny-guided Robot Data Generation using Video Diffusion Transformers (CRAFT), a video diffusion-based framework for scalable bimanual demonstration generation that synthesizes temporally coherent manipulation videos while producing action labels. By conditioning video diffusion on edge-based structural cues extracted from simulator-generated trajectories, CRAFT produces physically plausible trajectory variations and supports a unified augmentation pipeline spanning object pose changes, camera viewpoints, lighting and background variations, cross-embodiment transfer, and multi-view synthesis. We leverage a pre-trained video diffusion model to convert simulated videos, along with action labels from the simulation trajectories, into action-consistent demonstrations. Starting from only a few real-world demonstrations, CRAFT generates a large, visually diverse set of photorealistic training data, bypassing the need to replay demonstrations on the real robot (Sim2Real). Across simulated and real-world bimanual tasks, CRAFT improves success rates over existing augmentation strategies and straightforward data scaling, demonstrating that diffusion-based video generation can substantially expand demonstration diversity and improve generalization for dual-arm manipulation tasks. Our project website is available at: https://craftaug.github.io/

Paper Structure

This paper contains 43 sections, 4 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Method Overview.(1) Trajectory Expansion: Real-world teleoperation data is first collected, and a digital twin pipeline transfers the objects and robot into simulation (Real2Sim). This simulation environment is then used for large-scale data generation. (2) Video Generation: The simulation trajectories are rendered into source videos and passed through a Canny-Edge Converter to extract structural edge representations, which are then combined with a real-world reference image and language instructions to condition a video diffusion model that synthesizes photorealistic video outputs. (3) Augmented Dataset Construction: The resulting generated videos support a wide range of visual variations, including object pose, lighting conditions, object color, background, cross-embodiment transfer, camera viewpoint, and combined wrist and third-person camera perspectives. (4) Generated Dataset: The synthesized videos are paired with action labels from the simulation trajectories, producing action-consistent demonstrations $\mathcal{D}^{\text{gen}}$ for downstream policy training.
  • Figure 2: Reference Image and Canny-edge Visualization. Examples of reference images used to condition the video diffusion model for different augmentation techniques: (1) A standard reference image of the scene capturing gripper-object contact, used to condition the video diffusion model. (2) An example Canny-edge frame extracted from the simulation source video $\mathbf{V}^{\text{c}}$, used as structural control input. (3) A lighting-modified reference image generated using Veo3 google2025veo3 under green ambient illumination. (4) An empty table reference image with no objects, used for object color generation. (5) A tiled reference image combining a third-person view (top left), left wrist (top right), and right wrist (bottom left), with the fourth tile left blank, supporting up to four simultaneous camera viewpoints. Reference images include top and bottom padding (not shown).
  • Figure 3: Simulation Environment. The SB (see Section \ref{['sec:sim_exps']}) task adapted from RoboTwin chen2025robotwin on the bimanual UR5 with WSG grippers, showing the initial state (left) and success state (right).
  • Figure 4: Real-world rollout trained on ACT of the PC (see Section \ref{['ssec:real-world-results']}), with the bottom-right corner of each image showing the progression of the task execution. Images are labeled in order of task progression.
  • Figure 5: Simulation Environments. Bimanual manipulation tasks adapted from RoboTwin chen2025robotwin, shown for the bimanual UR5 with WSG grippers (left) and the bimanual Franka Panda (right). Tasks from top to bottom: Lift Pot (LP), Place Cans in Plasticbox (PC), and Stack Bowls (SB).
  • ...and 7 more figures