Table of Contents
Fetching ...

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, Yunzhu Li

TL;DR

This paper introduces RIGVid, a paradigm where robots learn manipulation solely from AI-generated video demonstrations conditioned on a scene and a language command. It combines a diffusion-based video generator, GPT-4o-based video filtering, monocular depth estimation, and FoundationPose-based 6D object tracking to extract a task-relevant trajectory and retarget it onto a robot in an embodiment-agnostic manner. Empirical results show that high-quality generated videos, when filtered, can match real demonstrations in effectiveness, and RIGVid outperforms several VLM-based and trajectory-extraction baselines across four manipulation tasks, with robustness to disturbances and transferability to new embodiments. The work highlights the potential of synthetic, task-specific supervision from generative models to reduce real-data collection while enabling open-world robotic manipulation.

Abstract

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

TL;DR

This paper introduces RIGVid, a paradigm where robots learn manipulation solely from AI-generated video demonstrations conditioned on a scene and a language command. It combines a diffusion-based video generator, GPT-4o-based video filtering, monocular depth estimation, and FoundationPose-based 6D object tracking to extract a task-relevant trajectory and retarget it onto a robot in an embodiment-agnostic manner. Empirical results show that high-quality generated videos, when filtered, can match real demonstrations in effectiveness, and RIGVid outperforms several VLM-based and trajectory-extraction baselines across four manipulation tasks, with robustness to disturbances and transferability to new embodiments. The work highlights the potential of synthetic, task-specific supervision from generative models to reduce real-data collection while enabling open-world robotic manipulation.

Abstract

This work introduces Robots Imitating Generated Videos (RIGVid), a system that enables robots to perform complex manipulation tasks--such as pouring, wiping, and mixing--purely by imitating AI-generated videos, without requiring any physical demonstrations or robot-specific training. Given a language command and an initial scene image, a video diffusion model generates potential demonstration videos, and a vision-language model (VLM) automatically filters out results that do not follow the command. A 6D pose tracker then extracts object trajectories from the video, and the trajectories are retargeted to the robot in an embodiment-agnostic fashion. Through extensive real-world evaluations, we show that filtered generated videos are as effective as real demonstrations, and that performance improves with generation quality. We also show that relying on generated videos outperforms more compact alternatives such as keypoint prediction using VLMs, and that strong 6D pose tracking outperforms other ways to extract trajectories, such as dense feature point tracking. These findings suggest that videos produced by a state-of-the-art off-the-shelf model can offer an effective source of supervision for robotic manipulation.

Paper Structure

This paper contains 26 sections, 1 equation, 21 figures, 3 tables.

Figures (21)

  • Figure 1: RIGVid overview. Given an initial scene image and depth, we generate a video conditioned on a language command. A VLM-based automatic filtering step (not shown) can be used to reject videos that fail to follow the prompt. A monocular depth estimator recovers depth for each frame of the generated video, and these depth maps are combined with the corresponding RGB frames to produce 6D Object Pose Trajectory. After grasping, the trajectory is retargeted to the robot for execution.
  • Figure 2: Re-targeting RIGVid to a robot trajectory. Assuming a fixed transformation between the end-effector and the object after grasping, the 6D Object Pose Trajectory (orange arrow) is re-targeted to the robot (blue arrow). This formulation is embodiment agnostic and can be transferred to a different robot.
  • Figure 3: RIGVid is robust to perturbations. A human pushes the robot during execution (image 1), causing the object to deviate from the planned trajectory. When the deviation is detected (image 2), the robot backtracks to the last successfully executed trajectory point (image 3) and then resumes the planned motion (image 4).
  • Figure 4: Evaluation tasks. We evaluate RIGVid on everyday manipulation tasks of varying difficulty.
  • Figure 5: Qualitative comparison of video generation for three models. Sora (top) drastically alters the scene layout and object size. Kling v1.5 (middle) does not fully follow the prompt (water not poured over the plant) and exhibits physically implausible behaviors (water pouring out of the top of the kettle but not the spout). Kling v1.6 (bottom) produces the most consistent and realistic result.
  • ...and 16 more figures