Table of Contents
Fetching ...

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

Songjia He, Zixuan Chen, Hongyu Ding, Dian Shao, Jieqi Shi, Chenxu Li, Jing Huo, Yang Gao

Abstract

Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.

V-Dreamer: Automating Robotic Simulation and Trajectory Synthesis via Video Generation Priors

Abstract

Training generalist robots demands large-scale, diverse manipulation data, yet real-world collection is prohibitively expensive, and existing simulators are often constrained by fixed asset libraries and manual heuristics. To bridge this gap, we present V-Dreamer, a fully automated framework that generates open-vocabulary, simulation-ready manipulation environments and executable expert trajectories directly from natural language instructions. V-Dreamer employs a novel generative pipeline that constructs physically grounded 3D scenes using large language models and 3D generative models, validated by geometric constraints to ensure stable, collision-free layouts. Crucially, for behavior synthesis, we leverage video generation models as rich motion priors. These visual predictions are then mapped into executable robot trajectories via a robust Sim-to-Gen visual-kinematic alignment module utilizing CoTracker3 and VGGT. This pipeline supports high visual diversity and physical fidelity without manual intervention. To evaluate the generated data, we train imitation learning policies on synthesized trajectories encompassing diverse object and environment variations. Extensive evaluations on tabletop manipulation tasks using the Piper robotic arm demonstrate that our policies robustly generalize to unseen objects in simulation and achieve effective sim-to-real transfer, successfully manipulating novel real-world objects.
Paper Structure (20 sections, 6 figures, 1 table)

This paper contains 20 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: We propose V-Dreamer, a fully automated full-cycle pipeline for instruction-driven, open-vocabulary robotic manipulation data synthesis. Given a natural language instruction (optionally with a real-scene photo), V-Dreamer generates a simulation-ready scene and an executable expert trajectory, which are then used to train a policy that can zero-shot generalize to unseen objects in simulation and transfer zero-shot to real robotic hardware.
  • Figure 2: Overview of the V-Dreamer Pipeline. The framework consists of three integrated stages: (1) Semantic-to-Physics Scene Synthesis. An LLM-based semantic planner decomposes prompts into asset manifests, which are then transformed into 3D meshes via generative visual synthesis and memory-efficient reconstruction before being assembled into a physics-validated layout. (2) Video-Prior-Based Trajectory Generation. Using the stabilized scene as a prior, we employ video foundation models with targeted negative prompting to dream up physically plausible manipulation motions. (3) Sim-to-Gen Alignment. Actionable 3D trajectories are extracted from the 2D pixel domain through mask-restricted tracking, metric depth estimation, and 3D motion lifting, ultimately mapping visual dreams into executable robot end-effector poses.
  • Figure 3: Qualitative examples of V-Dreamer synthesized tasks of increasing difficulty. Left: Input Prompts and generated 3D scenes. Middle: Generated manipulation videos and the corresponding robot manipulation results. Right: Extracted end-effector trajectories for demonstrations.
  • Figure 4: Qualitative examples of V-Dreamer synthesized scenes with varying object instances, textures, and layouts.
  • Figure 5: Simulation Evaluation.(Left top) Representative synthesized training scenes with object augmentation, illustrating diverse spatial layouts used to generate demonstrations. (Left bottom) Zero-shot inference trajectory of the learned policy on an unseen mug, showing the manipulation sequence (Init $\to$ Pick $\to$ Grasp $\to$ Place). (Right) Success rate of the learned policy on 10 unseen mug instances under different training dataset sizes (500 trials each). Performance improves with increasing data scale and peaks at 2.5k synthesized demonstrations.
  • ...and 1 more figures