HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning
Wensheng Wang, Ning Tan
TL;DR
HybridGen tackles the data bottleneck in robotic imitation learning by fusing Vision-Language Model guidance with a two-stage hybrid planning framework to generate large, diverse, format-independent demonstration datasets from a small set of human videos. It first uses VLM-based parsing to split tasks into expert-dependent and plannable segments and then expands these through pose transformations; a second augmentation stage enables free subtask selection via a Nearest Grasp Object Relative to Target Object strategy, preserving relative object poses across scenes. The approach is instantiated with CLIP-based dense keypoint constraints, Gemini-generated task constraints, and a constrained path planner that optimizes a multi-term objective including semantic, collision, smoothness, and IK terms, ensuring kinematic feasibility. Empirically, HybridGen yields a 5% average improvement over MimicGen across seven tasks and variants, with notable gains on difficult variants (59.7% vs 49.5%), and demonstrates robustness across BC-RNN, BC-Transformer, and Diffusion Policy, underscoring its broad applicability for scalable imitation learning in robotic manipulation.
Abstract
The acquisition of large-scale and diverse demonstration data are essential for improving robotic imitation learning generalization. However, generating such data for complex manipulations is challenging in real-world settings. We introduce HybridGen, an automated framework that integrates Vision-Language Model (VLM) and hybrid planning. HybridGen uses a two-stage pipeline: first, VLM to parse expert demonstrations, decomposing tasks into expert-dependent (object-centric pose transformations for precise control) and plannable segments (synthesizing diverse trajectories via path planning); second, pose transformations substantially expand the first-stage data. Crucially, HybridGen generates a large volume of training data without requiring specific data formats, making it broadly applicable to a wide range of imitation learning algorithms, a characteristic which we also demonstrate empirically across multiple algorithms. Evaluations across seven tasks and their variants demonstrate that agents trained with HybridGen achieve substantial performance and generalization gains, averaging a 5% improvement over state-of-the-art methods. Notably, in the most challenging task variants, HybridGen achieves significant improvement, reaching a 59.7% average success rate, significantly outperforming Mimicgen's 49.5%. These results demonstrating its effectiveness and practicality.
