Table of Contents
Fetching ...

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin

TL;DR

RoboMaster addresses the challenge of generating realistic robotic manipulation videos by introducing a collaborative trajectory mechanism that unifies arm and object motion across three interaction phases. It couples appearance- and shape-aware object embeddings with a diffusion-transformer backbone and a motion injector to mitigate feature entanglement during interaction. The approach achieves state-of-the-art trajectory accuracy and visual quality on Bridge V2 and demonstrates robust generalization in-the-wild, supported by extensive ablations. This framework enables interactive, high-fidelity synthetic data generation for robotic policy learning and sim-to-real research.

Abstract

Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

TL;DR

RoboMaster addresses the challenge of generating realistic robotic manipulation videos by introducing a collaborative trajectory mechanism that unifies arm and object motion across three interaction phases. It couples appearance- and shape-aware object embeddings with a diffusion-transformer backbone and a motion injector to mitigate feature entanglement during interaction. The approach achieves state-of-the-art trajectory accuracy and visual quality on Bridge V2 and demonstrates robust generalization in-the-wild, supported by extensive ablations. This framework enables interactive, high-fidelity synthetic data generation for robotic policy learning and sim-to-real research.

Abstract

Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.

Paper Structure

This paper contains 23 sections, 7 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: RoboMaster synthesizes realistic robotic manipulation video given an initial frame, a prompt, a user-defined object mask, and a collaborative trajectory describing the motion of both robotic arm and manipulated object in decomposed interaction phases. It supports diverse manipulation skills and can generalize to in-the-wild scenarios. Please check more on our https://fuxiao0719.github.io/projects/robomaster/.
  • Figure 2: Collaborative Trajectory (Ours) vs Separated Trajectories (Previous, e.g. Tora). Unlike Tora zhang2024tora that decomposes objects and uses separate trajectories to model the motion of robot arm and manipulated object, we decompose the interaction phase and unify their joint motions into a single collaborative trajectory with fine-grained object awareness. This integration alleviates the feature fusion issue in overlapping regions (see the missing apple in Tora), and improves visual quality.
  • Figure 3: RoboMaster Framework. Given an input image $\mathbf{I}$ and a prompt $\mathbf{c}$, it generates a desired robotic manipulation video $\mathbf{X}$ with the collaborative trajectory design. Specifically, it first encodes the object masks, including robotic arm $\mathbf{M}_d$ and submissive object $\mathbf{M}_s$ (acquired either from 1) Grounded-SAM ren2024grounded or 2) user-defined brush mask) with the awareness of appearance and shape to obtain $\mathbf{v}_d, \mathbf{v}_s$ for maintaining identity consistency in the video. To precisely model the manipulation process, the controlled trajectory $\mathcal{C}$ is decomposed into sub-interaction phases: pre-interaction $\mathcal{C}_1$, interaction $\mathcal{C}_2$, and post-interaction $\mathcal{C}_3$, associating each phase with object-specific latents $\mathbf{v}_d$, $\mathbf{v}_s$, and $\mathbf{v}_d$, respectively. The collaborative trajectory latent $\mathbf{V}$ is then injected into plug-and-play motion injectors, enabling the reasoning of video dynamics during generation.
  • Figure 4: Subject Embedding Illustration. The object mask $\mathbf{M}$ is interpolated to align with the encoded RGB latents $\mathbf{z}$. Then it samples $\mathbf{z}$ with valid pixels and applies an average pooling operator to generate the embedding $\tilde{\mathbf{v}}$. To enhance spatial awareness, it expands the object token by a radius $r$, which is proportional to the area of the valid mask region, and obtains the circular volume $\mathbf{v}$.
  • Figure 5: Qualitative Comparison. RoboMaster (ours) demonstrates superior performance across a range of manipulation skills (e.g., move, pick, close, upright, close), exhibiting improved visual consistency of the manipulated subject compared to prior baselines.
  • ...and 7 more figures