Table of Contents
Fetching ...

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu, Jun Guo, Nan Sun, Long Qian, Xinghang Li, Xin Xiao, Jing Liu, Nianfeng Liu, Tao Kong, Yan Huang, Liang Wang, Tieniu Tan

Abstract

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Abstract

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.

Paper Structure

This paper contains 51 sections, 6 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview. We introduce MV-VDP, a multi-view video diffusion policy that jointly models the spatio-temporal state of the environment. Compared to prior manipulation policies, our approach: (1) processes 3D-aware multi-view images rather than independent multiple 2D views; (2) represents robot states and actions as multi-view heatmaps, aligning the action space with the representation used in video pretraining; (3) leverages a video foundation model, instead of a traditional vision--language backbone, to jointly model future RGB sequences and heatmap sequences. As a result, MV-VDP achieves state-of-the-art performance on both the Meta-World and real-world benchmarks, outperforming video-prediction–based, 3D-based, and vision--language--action models.
  • Figure 2: Overview of MV-VDP's pipeline. (a) Point clouds and the current end-effector pose are projected into spatial-aware multi-view RGB images and heatmaps, which are encoded and used to jointly predict future multi-view RGB videos and heatmap videos via a video diffusion model. Predicted heatmaps are back-projected to recover 3D end-effector positions. (b) The multi-view video diffusion transformer augments a pretrained video diffusion backbone with view-attention to enable cross-view interaction. (c) A lightweight action decoder predicts end-effector rotation and gripper states from the denoised video latents. Final action chunks are formed by combining the predicted positions, rotations, and gripper states.
  • Figure 3: Real-world experimental setup and tasks. We evaluate MV-VDP on three manipulation tasks using a Franka Research 3 robot with three ZED2i cameras. We further assess generalization under variations in background, object height, lighting, and object category.
  • Figure 4: Average success rates for different inference denoising steps. The experiments are conducted on the Meta-World benchmark. MV-VDP demonstrates high robustness to varying diffusion steps, achieving strong performance even when the denoising step is set to 1.
  • Figure 5: Visualization of the predicted RGB sequences and heatmap sequences for the Button-Press-Top task in Meta-World. For each view, the first and third rows show predictions from MV-VDP, while the second and fourth rows show the corresponding ground truth. The peak locations of both predicted and ground-truth heatmaps are overlaid on the predicted and ground-truth RGB images, respectively. The results show that (1) the predicted RGB sequences are visually realistic, and (2) the heatmap peaks closely follow the end effector in the RGB images, indicating strong consistency between the predicted heatmaps and RGB sequences.
  • ...and 6 more figures