Table of Contents
Fetching ...

DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning

Zhengrong Xue, Shuying Deng, Zhenyang Chen, Yixuan Wang, Zhecheng Yuan, Huazhe Xu

TL;DR

DemoGen tackles the data inefficiency of visuomotor policy learning by generating fully synthetic, spatially augmented demonstrations from a single human example. It combines Task and Motion Planning-based action adaptation with 3D point-cloud observation synthesis to produce usable, varied demonstrations at negligible compute cost. Real-world and simulated experiments show significant improvements in spatial generalization across diverse tasks and platforms, with extensions enabling disturbance resistance and obstacle avoidance. The approach narrows the data-collection burden in robotic manipulation while preserving closed-loop control capabilities, though visual-mismatch and point-cloud segmentation limitations remain areas for further improvement.

Abstract

Visuomotor policies have shown great promise in robotic manipulation but often require substantial amounts of human-collected data for effective performance. A key reason underlying the data demands is their limited spatial generalization capability, which necessitates extensive data collection across different object configurations. In this work, we present DemoGen, a low-cost, fully synthetic approach for automatic demonstration generation. Using only one human-collected demonstration per task, DemoGen generates spatially augmented demonstrations by adapting the demonstrated action trajectory to novel object configurations. Visual observations are synthesized by leveraging 3D point clouds as the modality and rearranging the subjects in the scene via 3D editing. Empirically, DemoGen significantly enhances policy performance across a diverse range of real-world manipulation tasks, showing its applicability even in challenging scenarios involving deformable objects, dexterous hand end-effectors, and bimanual platforms. Furthermore, DemoGen can be extended to enable additional out-of-distribution capabilities, including disturbance resistance and obstacle avoidance.

DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning

TL;DR

DemoGen tackles the data inefficiency of visuomotor policy learning by generating fully synthetic, spatially augmented demonstrations from a single human example. It combines Task and Motion Planning-based action adaptation with 3D point-cloud observation synthesis to produce usable, varied demonstrations at negligible compute cost. Real-world and simulated experiments show significant improvements in spatial generalization across diverse tasks and platforms, with extensions enabling disturbance resistance and obstacle avoidance. The approach narrows the data-collection burden in robotic manipulation while preserving closed-loop control capabilities, though visual-mismatch and point-cloud segmentation limitations remain areas for further improvement.

Abstract

Visuomotor policies have shown great promise in robotic manipulation but often require substantial amounts of human-collected data for effective performance. A key reason underlying the data demands is their limited spatial generalization capability, which necessitates extensive data collection across different object configurations. In this work, we present DemoGen, a low-cost, fully synthetic approach for automatic demonstration generation. Using only one human-collected demonstration per task, DemoGen generates spatially augmented demonstrations by adapting the demonstrated action trajectory to novel object configurations. Visual observations are synthesized by leveraging 3D point clouds as the modality and rearranging the subjects in the scene via 3D editing. Empirically, DemoGen significantly enhances policy performance across a diverse range of real-world manipulation tasks, showing its applicability even in challenging scenarios involving deformable objects, dexterous hand end-effectors, and bimanual platforms. Furthermore, DemoGen can be extended to enable additional out-of-distribution capabilities, including disturbance resistance and obstacle avoidance.

Paper Structure

This paper contains 37 sections, 8 equations, 20 figures, 6 tables.

Figures (20)

  • Figure 1: Qualitative visualization of the spatial effective range. The grid maps display discretized tabletop workspaces from a bird's-eye view under different demonstration configurations. Dark green spots mark the locations where buttons are placed during the demonstrations. Each grid cell corresponds to a policy rollout with the button placed at that location. Blue, yellow, green, and gray grids denote successful executions for the Button-Large, Button-Small, both tasks, and no tasks, respectively.
  • Figure 2: Quantitative benchmarking on the spatial generalization capacity. We report the relationship between the agent's performance in success rates and the number of demonstrations used for training when different visuomotor policies and object randomization ranges are adopted. The results are averaged over $3$ seeds.
  • Figure 3: Pre-processing the source demonstration. The raw point cloud observations are processed by cropping, clustering, and down-sampling. The source action trajectory is parsed into motion and skill segments by referring to the semantic masks of manipulated objects.
  • Figure 4: Illustrations for action generation. (Left) Actions in the motion stage are planned to connect the neighboring skill segments. (Right) Actions in the skill stage undergo a uniform transformation.
  • Figure 5: Illustrations for synthetic visual observation generation. Objects in the to-do stage are segmented and transformed by the target object configurations. Objects in the doing stage are merged with the end-effector and transformed according to the proprioceptive states.
  • ...and 15 more figures