Table of Contents
Fetching ...

HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning

Wensheng Wang, Ning Tan

TL;DR

HybridGen tackles the data bottleneck in robotic imitation learning by fusing Vision-Language Model guidance with a two-stage hybrid planning framework to generate large, diverse, format-independent demonstration datasets from a small set of human videos. It first uses VLM-based parsing to split tasks into expert-dependent and plannable segments and then expands these through pose transformations; a second augmentation stage enables free subtask selection via a Nearest Grasp Object Relative to Target Object strategy, preserving relative object poses across scenes. The approach is instantiated with CLIP-based dense keypoint constraints, Gemini-generated task constraints, and a constrained path planner that optimizes a multi-term objective including semantic, collision, smoothness, and IK terms, ensuring kinematic feasibility. Empirically, HybridGen yields a 5% average improvement over MimicGen across seven tasks and variants, with notable gains on difficult variants (59.7% vs 49.5%), and demonstrates robustness across BC-RNN, BC-Transformer, and Diffusion Policy, underscoring its broad applicability for scalable imitation learning in robotic manipulation.

Abstract

The acquisition of large-scale and diverse demonstration data are essential for improving robotic imitation learning generalization. However, generating such data for complex manipulations is challenging in real-world settings. We introduce HybridGen, an automated framework that integrates Vision-Language Model (VLM) and hybrid planning. HybridGen uses a two-stage pipeline: first, VLM to parse expert demonstrations, decomposing tasks into expert-dependent (object-centric pose transformations for precise control) and plannable segments (synthesizing diverse trajectories via path planning); second, pose transformations substantially expand the first-stage data. Crucially, HybridGen generates a large volume of training data without requiring specific data formats, making it broadly applicable to a wide range of imitation learning algorithms, a characteristic which we also demonstrate empirically across multiple algorithms. Evaluations across seven tasks and their variants demonstrate that agents trained with HybridGen achieve substantial performance and generalization gains, averaging a 5% improvement over state-of-the-art methods. Notably, in the most challenging task variants, HybridGen achieves significant improvement, reaching a 59.7% average success rate, significantly outperforming Mimicgen's 49.5%. These results demonstrating its effectiveness and practicality.

HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning

TL;DR

HybridGen tackles the data bottleneck in robotic imitation learning by fusing Vision-Language Model guidance with a two-stage hybrid planning framework to generate large, diverse, format-independent demonstration datasets from a small set of human videos. It first uses VLM-based parsing to split tasks into expert-dependent and plannable segments and then expands these through pose transformations; a second augmentation stage enables free subtask selection via a Nearest Grasp Object Relative to Target Object strategy, preserving relative object poses across scenes. The approach is instantiated with CLIP-based dense keypoint constraints, Gemini-generated task constraints, and a constrained path planner that optimizes a multi-term objective including semantic, collision, smoothness, and IK terms, ensuring kinematic feasibility. Empirically, HybridGen yields a 5% average improvement over MimicGen across seven tasks and variants, with notable gains on difficult variants (59.7% vs 49.5%), and demonstrates robustness across BC-RNN, BC-Transformer, and Diffusion Policy, underscoring its broad applicability for scalable imitation learning in robotic manipulation.

Abstract

The acquisition of large-scale and diverse demonstration data are essential for improving robotic imitation learning generalization. However, generating such data for complex manipulations is challenging in real-world settings. We introduce HybridGen, an automated framework that integrates Vision-Language Model (VLM) and hybrid planning. HybridGen uses a two-stage pipeline: first, VLM to parse expert demonstrations, decomposing tasks into expert-dependent (object-centric pose transformations for precise control) and plannable segments (synthesizing diverse trajectories via path planning); second, pose transformations substantially expand the first-stage data. Crucially, HybridGen generates a large volume of training data without requiring specific data formats, making it broadly applicable to a wide range of imitation learning algorithms, a characteristic which we also demonstrate empirically across multiple algorithms. Evaluations across seven tasks and their variants demonstrate that agents trained with HybridGen achieve substantial performance and generalization gains, averaging a 5% improvement over state-of-the-art methods. Notably, in the most challenging task variants, HybridGen achieves significant improvement, reaching a 59.7% average success rate, significantly outperforming Mimicgen's 49.5%. These results demonstrating its effectiveness and practicality.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: HybridGen Overview. We propose an automated data generation framework leveraging VLM and hybrid planning strategies. (a) The first stage involves the initial augmentation of limited human data using pose adaptation and VLM replanning. (b) The second stage performs further augmentation through pose-only adaptation, resulting in a larger training dataset. (c) The data generated by this framework can be seamlessly used to train various imitation learning algorithms.
  • Figure 2: Pipeline. For each task, HybridGen takes the following inputs: a textual task description, a video recording of a human demonstration, and an initial RGB image of the environment. The Video Decomposer parses the task , dividing the complete trajectory into Plannable and Expert-Dependent segments. The Keypoint Extractor extract task-relevant keypoints. The Constraint Miner module takes the environment's RGB image (annotated with keypoints) and the task description, and generates the constraints required to accomplish the task. The Path Planner first calculates the transformed expert demonstrations for the current environment based on the Expert demonstrations data, then replans the Plannable segments according to the derived constraints. Through this process, HybridGen integrates expert demonstrations information with the prior knowledge of a VLM, significantly enhancing the diversity of the generated demonstrations.
  • Figure 3: Keypoint Extraction Process. (a) Heatmap generated by CLIP highlighting regions semantically relevant to the task description. Brighter areas indicate higher relevance. (b) Keypoints extracted from the heatmap, representing task-relevant spatial locations. The keypoint 0 is always the end-effector.
  • Figure 4: Tasks. We evaluate the HybridGen framework on seven manipulation tasks and their variants. Each task involves interaction between a grasped object and a target object. Tasks (a-c) require fine-grained manipulation and precise control. Tasks (d-g) are multi-stage tasks that test the agent's ability to plan over long horizons.
  • Figure 5: Effectiveness across algorithms of HybridGen. Success rates (3 seeds) comparison between HybridGen and MimicGen on Square task variants ($D_0$, $D_1$, $D_2$) using BC-Transformer and Diffusion Policy architectures. HybridGen demonstrates consistent performance advantages across fundamentally different policy learning paradigms, validating its ability to generate universally useful demonstration data for diverse imitation learning approaches.
  • ...and 1 more figures