ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory
Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Sirui Han, Shanghang Zhang
TL;DR
Robotic manipulation video generation often suffers from 2D spatial ambiguity and limited demonstrations. This work introduces ManipDreamer3D, which reconstructs a 3D occupancy map from a single input image, computes a collision-free, short 3D trajectory for the end-effector, and synthesizes the video by conditioning a diffusion model on a 3D-to-2D trajectory representation. Trajectory optimization combines multi-objective losses $\mathcal{L}_{col}$, $\mathcal{L}_{len}$, $\mathcal{L}_{acc}$, and $\mathcal{L}_{cur}$ to obtain $P^3_{opt}$, followed by path-aware time reallocation and 3D-to-2D latent editing to drive high-fidelity video generation. Experiments on bridge datasets show state-of-the-art video quality (FVD, PSNR, SSIM) and precise trajectory adherence, enabling fine-grained control at keypoint, full-trajectory, and affordance levels with reduced manual intervention.
Abstract
Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.
