Table of Contents
Fetching ...

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Youpeng Wen, Junfan Lin, Yi Zhu, Jianhua Han, Hang Xu, Shen Zhao, Xiaodan Liang

TL;DR

The proposed VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset, providing compelling evidence that world models can significantly enhance the precision of robot action prediction.

Abstract

Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

TL;DR

The proposed VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset, providing compelling evidence that world models can significantly enhance the precision of robot action prediction.

Abstract

Recent advancements utilizing large-scale video data for learning video generation models demonstrate significant potential in understanding complex physical dynamics. It suggests the feasibility of leveraging diverse robot trajectory data to develop a unified, dynamics-aware model to enhance robot manipulation. However, given the relatively small amount of available robot data, directly fitting data without considering the relationship between visual observations and actions could lead to suboptimal data utilization. To this end, we propose VidMan (Video Diffusion for Robot Manipulation), a novel framework that employs a two-stage training mechanism inspired by dual-process theory from neuroscience to enhance stability and improve data utilization efficiency. Specifically, in the first stage, VidMan is pre-trained on the Open X-Embodiment dataset (OXE) for predicting future visual trajectories in a video denoising diffusion manner, enabling the model to develop a long horizontal awareness of the environment's dynamics. In the second stage, a flexible yet effective layer-wise self-attention adapter is introduced to transform VidMan into an efficient inverse dynamics model that predicts action modulated by the implicit dynamics knowledge via parameter sharing. Our VidMan framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset. These results provide compelling evidence that world models can significantly enhance the precision of robot action prediction. Codes and models will be public.

Paper Structure

This paper contains 19 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: VidMan's two-stage training paradigm mirrors dual process theory: its first stage (like System 2) pre-trains on understanding environment dynamics through video diffusion, forming a foundation for accurate action prediction, while its second stage (like System 1) was adapted from the first stage to leverage the learned dynamics knowledge for rapid, low-level action inference.
  • Figure 2: Overview of VidMan. (a) We use Video Tokenizer to tokenize the uniform sampled robot visual trajectory $O_s$ to video tokens $V_s$. (b) In the 1st Stage, we concatenate the video tokens processed through the diffusion process with the historical tokens along the channel dimension to form $V_c^k$. $V_c^k$ along with the language tokens and diffusion step $k$ are fed into Open-Sora for video prediction training. In the 2nd Stage, we use a learnable action token through a layer-wise adapter applied to the output of the Open-Sora Block to obtain tokens $V_\text{action}$ that integrate future frame information. $V_\text{action}$ are then fed into the Diffusion Action Head $\pi_{\phi_{dec}}$ for action prediction training.
  • Figure 3: Offline Performance. The average accuracy (Avg xyz ang) of xyz accuracy and angle accuracy and MSE correspond to the left and right y-axes of the graph respectively. All models were trained on OXE and validated on offline performance across four datasets. VidMan outperformed Octo-base octo_2023 by 5.6% on Bridge, 2.6% on Taco Play, 9.9% on Cable Routing, and 9.0% on Autolab UR5. Additionally, Our method also shows improvements over the VidMan-GPT approach.
  • Figure 4: Efficiency comparison between two types of training.
  • Figure 5: Our model utilizes a layer-wise adapter, which includes a self-attention layer and a feed-forward network (FFN). This block uses a gating mechanism to distill the information extracted by the Open-Sora block into the action query.
  • ...and 2 more figures