Table of Contents
Fetching ...

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu

TL;DR

Vidar tackles the challenge of transferring general-purpose manipulation to new robot embodiments with limited demonstrations by decoupling video-based priors from embodiment-specific actions. It leverages a unified, multi-view video diffusion model pretrained on Internet-scale and large robotic datasets, plus a Masked Inverse Dynamics Model to ground predictions in the target robot. Test-time scaling further improves rollout quality by selecting the best generated video using a vision-language evaluator. The approach achieves state-of-the-art results on RoboTwin 2.0 and strong real-world generalization with only about 20 minutes of demonstrations, illustrating the viability of a one-prior-many-embodiments paradigm for scalable embodied AI.

Abstract

Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

TL;DR

Vidar tackles the challenge of transferring general-purpose manipulation to new robot embodiments with limited demonstrations by decoupling video-based priors from embodiment-specific actions. It leverages a unified, multi-view video diffusion model pretrained on Internet-scale and large robotic datasets, plus a Masked Inverse Dynamics Model to ground predictions in the target robot. Test-time scaling further improves rollout quality by selecting the best generated video using a vision-language evaluator. The approach achieves state-of-the-art results on RoboTwin 2.0 and strong real-world generalization with only about 20 minutes of demonstrations, illustrating the viability of a one-prior-many-embodiments paradigm for scalable embodied AI.

Abstract

Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.

Paper Structure

This paper contains 36 sections, 7 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: The overall pipeline of Vidar, where various video sources are leveraged for transferring to a new platform with limited demonstrations. A unified observation space handles heterogeneous, multi-view robotic videos and language instructions, enabling the pre-training of an embodied foundation video model on about 750,000 multi-view bimanual robotic episodes. After fine-tuning it with only 20 minutes of human demonstrations on an unseen robot platform, we adopt test-time scaling to select the best video during inference. Meanwhile, the masked inverse dynamics model (MIDM) converts videos to actions, where masks are learned to attend to action-relevant regions for background-robust action regression.
  • Figure 2: Videos of predictions (left) and corresponding executions (right) of Vidar for challenging tasks. It can handle unseen tasks and unseen backgrounds with strong semantic understanding.
  • Figure 3: Input images and corresponding masked images learned by the masked inverse dynamic model (MIDM). The two cases are from an unseen background with complex reflective surfaces, while the predicted mask images still focus on the essential parts of robotic arms.
  • Figure 4: Prompt for GPT-4o evaluation. Variables in the curly braces should be replaced by corresponding values. Specifically, one line of "img_seq" describes one video, and is formatted as "- **video_1**: image_1, image_2, ..., image_{n_imgs_per_video}".
  • Figure 5: Masked images learned by the masked inverse dynamic model (MIDM) with different values of $\lambda$.
  • ...and 5 more figures