Table of Contents
Fetching ...

EasyInsert: A Data-Efficient and Generalizable Insertion Policy

Guanghe Li, Junming Zhao, Shengjie Wang, Yang Gao

TL;DR

EasyInsert reframes robotic insertion as delta-pose prediction between plug and socket, enabling a data-efficient, generalizable policy trained from real-world data without CAD models or sim-to-real transfer. A diffusion-policy delta-pose predictor powers a coarse-to-fine execution strategy that robustly handles unseen objects, clutter, and sizable initial pose deviations. With as little as 5 hours of pretraining data, the approach achieves over 90% zero-shot success on most unseen tasks and can reach 100% with a single demonstration and a few minutes of fine-tuning, illustrating strong practical potential for industrial settings. The method reduces data collection costs while delivering broad generalization across objects, spatial configurations, and environments, laying a foundation for general-purpose robotic assembly.

Abstract

Insertion task is highly challenging that requires robots to operate with exceptional precision in cluttered environments. Existing methods often have poor generalization capabilities. They typically function in restricted and structured environments, and frequently fail when the plug and socket are far apart, when the scene is densely cluttered, or when handling novel objects. They also rely on strong assumptions such as access to CAD models or a digital twin in simulation. To address this, we propose EasyInsert, a framework which leverages the human intuition that relative pose (delta pose) between plug and socket is sufficient for successful insertion, and employs efficient and automated real-world data collection with minimal human labor to train a generalizable model for relative pose prediction. During execution, EasyInsert follows a coarse-to-fine execution procedure based on predicted delta pose, and successfully performs various insertion tasks. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, handling cases with significant initial pose deviations while maintaining high sample efficiency and requiring little human effort. In real-world experiments, with just 5 hours of training data, EasyInsert achieves over 90% success in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, with only one human demonstration and 4 minutes of automatically collected data for fine-tuning, it reaches over 90% success rate for all 15 objects.

EasyInsert: A Data-Efficient and Generalizable Insertion Policy

TL;DR

EasyInsert reframes robotic insertion as delta-pose prediction between plug and socket, enabling a data-efficient, generalizable policy trained from real-world data without CAD models or sim-to-real transfer. A diffusion-policy delta-pose predictor powers a coarse-to-fine execution strategy that robustly handles unseen objects, clutter, and sizable initial pose deviations. With as little as 5 hours of pretraining data, the approach achieves over 90% zero-shot success on most unseen tasks and can reach 100% with a single demonstration and a few minutes of fine-tuning, illustrating strong practical potential for industrial settings. The method reduces data collection costs while delivering broad generalization across objects, spatial configurations, and environments, laying a foundation for general-purpose robotic assembly.

Abstract

Insertion task is highly challenging that requires robots to operate with exceptional precision in cluttered environments. Existing methods often have poor generalization capabilities. They typically function in restricted and structured environments, and frequently fail when the plug and socket are far apart, when the scene is densely cluttered, or when handling novel objects. They also rely on strong assumptions such as access to CAD models or a digital twin in simulation. To address this, we propose EasyInsert, a framework which leverages the human intuition that relative pose (delta pose) between plug and socket is sufficient for successful insertion, and employs efficient and automated real-world data collection with minimal human labor to train a generalizable model for relative pose prediction. During execution, EasyInsert follows a coarse-to-fine execution procedure based on predicted delta pose, and successfully performs various insertion tasks. EasyInsert demonstrates strong zero-shot generalization capability for unseen objects in cluttered environments, handling cases with significant initial pose deviations while maintaining high sample efficiency and requiring little human effort. In real-world experiments, with just 5 hours of training data, EasyInsert achieves over 90% success in zero-shot insertion for 13 out of 15 unseen novel objects, including challenging objects like Type-C cables, HDMI cables, and Ethernet cables. Furthermore, with only one human demonstration and 4 minutes of automatically collected data for fine-tuning, it reaches over 90% success rate for all 15 objects.

Paper Structure

This paper contains 23 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of our framework: (1) Left: Zero-shot evaluation of generalization ability: Along the horizontal axis, the model generalizes to increasingly complex environments starting from a distant initial position. Along the vertical axis, it generalizes to diverse, previously unseen objects. (2) Middle: Generalization performance evaluated under 0-shot and 1-shot settings. (3) Right: Our method's data collection module gathers 5 hours of pretraining data, with only 20% manually collected. Trained on 5 object categories, the model generalizes to 15 unseen test objects.
  • Figure 2: Overview of our method: (1) Left: Data collection module: that constructs training dataset with 80% automated and 20% manual data collection, where manual data collection focuses on fine-grained interactions around the socket area and auto-collection scale data in a larger spatial range. (2) Middle: Generalist Policy pretrained from the collected data, predicts relative pose between plug and socket directly from visual inputs. For tasks requiring higher precision, the same data collection module can be reused to perform one-shot finetuning on the target objects. (3) Right: Motivated by human insertion behavior, we design a similar coarse-to-fine execution process for the robot.
  • Figure 3: Coarse-to-fine hierarchical insertion procedure.
  • Figure 4: Left: We randomly place distraction objects aronud the socket. Right: In perturbation experiments, we randomly move socket positions.
  • Figure 5: Ablation study on data augmentation and manually collected data.
  • ...and 7 more figures