Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu; Xintao Wang; Xian Liu; Jianhong Bai; Runsen Xu; Pengfei Wan; Di Zhang; Dahua Lin

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin

TL;DR

RoboMaster addresses the challenge of generating realistic robotic manipulation videos by introducing a collaborative trajectory mechanism that unifies arm and object motion across three interaction phases. It couples appearance- and shape-aware object embeddings with a diffusion-transformer backbone and a motion injector to mitigate feature entanglement during interaction. The approach achieves state-of-the-art trajectory accuracy and visual quality on Bridge V2 and demonstrates robust generalization in-the-wild, supported by extensive ablations. This framework enables interactive, high-fidelity synthetic data generation for robotic policy learning and sim-to-real research.

Abstract

Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

TL;DR

Abstract

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)