Table of Contents
Fetching ...

One-Shot Real-World Demonstration Synthesis for Scalable Bimanual Manipulation

Huayi Zhou, Kui Jia

TL;DR

BiDemoSyn tackles the data bottleneck of real-world bimanual manipulation by synthesizing thousands of feasible demonstrations from a single exemplar without simulation. It decomposes tasks into invariant and adaptable primitives, aligns to novel scenes through vision-based frame alignment, and optimizes dual-arm trajectories under physical constraints to produce diverse, ground-truth demonstrations. The approach yields strong policy generalization, scales data collection by orders of magnitude, and closes the gap between data efficiency and real-world fidelity. This enables practical imitation learning for complex bimanual tasks without resorting to extensive teleoperation or imperfect simulation.

Abstract

Learning dexterous bimanual manipulation policies critically depends on large-scale, high-quality demonstrations, yet current paradigms face inherent trade-offs: teleoperation provides physically grounded data but is prohibitively labor-intensive, while simulation-based synthesis scales efficiently but suffers from sim-to-real gaps. We present BiDemoSyn, a framework that synthesizes contact-rich, physically feasible bimanual demonstrations from a single real-world example. The key idea is to decompose tasks into invariant coordination blocks and variable, object-dependent adjustments, then adapt them through vision-guided alignment and lightweight trajectory optimization. This enables the generation of thousands of diverse and feasible demonstrations within several hour, without repeated teleoperation or reliance on imperfect simulation. Across six dual-arm tasks, we show that policies trained on BiDemoSyn data generalize robustly to novel object poses and shapes, significantly outperforming recent baselines. By bridging the gap between efficiency and real-world fidelity, BiDemoSyn provides a scalable path toward practical imitation learning for complex bimanual manipulation without compromising physical grounding.

One-Shot Real-World Demonstration Synthesis for Scalable Bimanual Manipulation

TL;DR

BiDemoSyn tackles the data bottleneck of real-world bimanual manipulation by synthesizing thousands of feasible demonstrations from a single exemplar without simulation. It decomposes tasks into invariant and adaptable primitives, aligns to novel scenes through vision-based frame alignment, and optimizes dual-arm trajectories under physical constraints to produce diverse, ground-truth demonstrations. The approach yields strong policy generalization, scales data collection by orders of magnitude, and closes the gap between data efficiency and real-world fidelity. This enables practical imitation learning for complex bimanual tasks without resorting to extensive teleoperation or imperfect simulation.

Abstract

Learning dexterous bimanual manipulation policies critically depends on large-scale, high-quality demonstrations, yet current paradigms face inherent trade-offs: teleoperation provides physically grounded data but is prohibitively labor-intensive, while simulation-based synthesis scales efficiently but suffers from sim-to-real gaps. We present BiDemoSyn, a framework that synthesizes contact-rich, physically feasible bimanual demonstrations from a single real-world example. The key idea is to decompose tasks into invariant coordination blocks and variable, object-dependent adjustments, then adapt them through vision-guided alignment and lightweight trajectory optimization. This enables the generation of thousands of diverse and feasible demonstrations within several hour, without repeated teleoperation or reliance on imperfect simulation. Across six dual-arm tasks, we show that policies trained on BiDemoSyn data generalize robustly to novel object poses and shapes, significantly outperforming recent baselines. By bridging the gap between efficiency and real-world fidelity, BiDemoSyn provides a scalable path toward practical imitation learning for complex bimanual manipulation without compromising physical grounding.

Paper Structure

This paper contains 35 sections, 10 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: From One to Many 1$\rightarrow$ N. Taking the example of dual-arm coordinated pouring task, we illustrate how to synthesize corresponding pre-grasping and lifting trajectories for different new placements and novel instances of manipulated objects during the initial frame alignment phase.
  • Figure 2: The overview of BiDemoSyn. It consists of three stages (e.g., deconstruction, alignment, and optimization) based on a given demonstration. Then, we can apply our method to complete data collection efficiently and conveniently in real-world. It is best to zoom in to view the details.
  • Figure 3: Illustrations of the initial frame alignment stage applied to tasks pouring (left and middle) and reorient (right). It shows that we can automatically adjust the grasp pose after the position, orientation and shape of the manipulated object changes.
  • Figure 4: (A) The data collection efficiency comparison of different baselines and our BiDemoSyn, and (B) the generated data quality illustration of DemoGen. Although DemoGen has the highest synthesis efficiency, it cannot avoid visual artifacts caused by perspective transformation, so its data quality is the lowest. Our method can achieve a balance between the speed and quality.
  • Figure 5: Comparison between training data scale and the success rate. Less sized data is randomly sampled out of the total dataset at the task level. DP3 is chosen as the visuomotor policy.
  • ...and 12 more figures