Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng; Hengkai Tan; Xinyi Mao; Chendong Xiang; Guodong Liu; Shuhe Huang; Hang Su; Jun Zhu

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, Jun Zhu

TL;DR

Vidar tackles the challenge of transferring general-purpose manipulation to new robot embodiments with limited demonstrations by decoupling video-based priors from embodiment-specific actions. It leverages a unified, multi-view video diffusion model pretrained on Internet-scale and large robotic datasets, plus a Masked Inverse Dynamics Model to ground predictions in the target robot. Test-time scaling further improves rollout quality by selecting the best generated video using a vision-language evaluator. The approach achieves state-of-the-art results on RoboTwin 2.0 and strong real-world generalization with only about 20 minutes of demonstrations, illustrating the viability of a one-prior-many-embodiments paradigm for scalable embodied AI.

Abstract

Scaling general-purpose manipulation to new robot embodiments remains challenging: each platform typically needs large, homogeneous demonstrations, and end-to-end pixel-to-action pipelines may degenerate under background and viewpoint shifts. Based on previous advances in video-based robot control, we present Vidar, consisting of an embodied video diffusion model as the generalizable prior and a masked inverse dynamics model (MIDM) as the adapter. We leverage a video diffusion model pre-trained at Internet scale, and further continuously pre-train it for the embodied domain using 750K multi-view trajectories collected from three real-world robot platforms. For this embodied pre-training, we introduce a unified observation space that jointly encodes robot, camera, task, and scene contexts. The MIDM module learns action-relevant pixel masks without dense labels, grounding the prior into the target embodiment's action space while suppressing distractors. With only 20 minutes of human demonstrations on an unseen robot (1% of typical data), Vidar outperforms state-of-the-art baselines and generalizes to unseen tasks, backgrounds, and camera layouts. Our results suggest a scalable recipe for "one prior, many embodiments": strong, inexpensive video priors together with minimal on-robot alignment.

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

TL;DR

Abstract

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)