Table of Contents
Fetching ...

Zero-Shot Peg Insertion: Identifying Mating Holes and Estimating SE(2) Poses with Vision-Language Models

Masaru Yajima, Kei Ota, Asako Kanezaki, Rei Kawakami

TL;DR

This work tackles zero-shot peg insertion into unseen holes by leveraging Vision-Language Models to jointly identify the correct mating hole and estimate its SE(2) pose without task-specific training. The approach integrates multi-view VLM-based hole matching, yaw-angle-aware pose estimation, candidate-hole detection, and a spiral insertion strategy, reinforced by a confidence-based ranking and a closed-loop refinement mechanism. Empirical results show strong generalization across 3D-printed parts, toy puzzles, and industrial connectors, achieving 90.2% hole-identification accuracy and 88.3% end-to-end insertion success on a real PC back panel. The findings demonstrate the potential of VLM-driven zero-shot reasoning to enable robust, adaptable robotic assembly in high-mix, low-volume settings, while outlining avenues for enhancement with tactile feedback and tighter pose-search loops.

Abstract

Achieving zero-shot peg insertion, where inserting an arbitrary peg into an unseen hole without task-specific training, remains a fundamental challenge in robotics. This task demands a highly generalizable perception system capable of detecting potential holes, selecting the correct mating hole from multiple candidates, estimating its precise pose, and executing insertion despite uncertainties. While learning-based methods have been applied to peg insertion, they often fail to generalize beyond the specific peg-hole pairs encountered during training. Recent advancements in Vision-Language Models (VLMs) offer a promising alternative, leveraging large-scale datasets to enable robust generalization across diverse tasks. Inspired by their success, we introduce a novel zero-shot peg insertion framework that utilizes a VLM to identify mating holes and estimate their poses without prior knowledge of their geometry. Extensive experiments demonstrate that our method achieves 90.2% accuracy, significantly outperforming baselines in identifying the correct mating hole across a wide range of previously unseen peg-hole pairs, including 3D-printed objects, toy puzzles, and industrial connectors. Furthermore, we validate the effectiveness of our approach in a real-world connector insertion task on a backpanel of a PC, where our system successfully detects holes, identifies the correct mating hole, estimates its pose, and completes the insertion with a success rate of 88.3%. These results highlight the potential of VLM-driven zero-shot reasoning for enabling robust and generalizable robotic assembly.

Zero-Shot Peg Insertion: Identifying Mating Holes and Estimating SE(2) Poses with Vision-Language Models

TL;DR

This work tackles zero-shot peg insertion into unseen holes by leveraging Vision-Language Models to jointly identify the correct mating hole and estimate its SE(2) pose without task-specific training. The approach integrates multi-view VLM-based hole matching, yaw-angle-aware pose estimation, candidate-hole detection, and a spiral insertion strategy, reinforced by a confidence-based ranking and a closed-loop refinement mechanism. Empirical results show strong generalization across 3D-printed parts, toy puzzles, and industrial connectors, achieving 90.2% hole-identification accuracy and 88.3% end-to-end insertion success on a real PC back panel. The findings demonstrate the potential of VLM-driven zero-shot reasoning to enable robust, adaptable robotic assembly in high-mix, low-volume settings, while outlining avenues for enhancement with tactile feedback and tighter pose-search loops.

Abstract

Achieving zero-shot peg insertion, where inserting an arbitrary peg into an unseen hole without task-specific training, remains a fundamental challenge in robotics. This task demands a highly generalizable perception system capable of detecting potential holes, selecting the correct mating hole from multiple candidates, estimating its precise pose, and executing insertion despite uncertainties. While learning-based methods have been applied to peg insertion, they often fail to generalize beyond the specific peg-hole pairs encountered during training. Recent advancements in Vision-Language Models (VLMs) offer a promising alternative, leveraging large-scale datasets to enable robust generalization across diverse tasks. Inspired by their success, we introduce a novel zero-shot peg insertion framework that utilizes a VLM to identify mating holes and estimate their poses without prior knowledge of their geometry. Extensive experiments demonstrate that our method achieves 90.2% accuracy, significantly outperforming baselines in identifying the correct mating hole across a wide range of previously unseen peg-hole pairs, including 3D-printed objects, toy puzzles, and industrial connectors. Furthermore, we validate the effectiveness of our approach in a real-world connector insertion task on a backpanel of a PC, where our system successfully detects holes, identifies the correct mating hole, estimates its pose, and completes the insertion with a success rate of 88.3%. These results highlight the potential of VLM-driven zero-shot reasoning for enabling robust and generalizable robotic assembly.

Paper Structure

This paper contains 13 sections, 1 equation, 8 figures, 2 tables.

Figures (8)

  • Figure 1: This work tackles the challenge of inserting an arbitrary peg into an previously unseen hole without prior knowledge of its type and geometry. We propose a novel framework that leverages a VLM to identify the correct mating hole (top right) and estimate its SE(2) pose (bottom right) in a zero-shot manner. Our method enables robust generalization across diverse peg-hole pairs, outperforming conventional approaches.
  • Figure 2: We provide the VLM with the peg image(s) and the candidate hole image(s) along with a prompt. The VLM determines whether the given peg and hole constitute a valid match by outputting either Yes or No, accompanied by the corresponding generation probability $p(o_m)$. This process is repeated for all candidate holes and the generated probabilities are used to rank the most suitable candidate, enabling the identification of the most compatible hole.
  • Figure 3: Yaw angle estimation process: (a) input RGB image $I^h_\text{all}$; (b) minimum bounding rectangle obtained from the segmentation mask; (c) image rotated to align the rectangle with the camera axis; (d) images rotated at 0$^{\circ}$, 90$^{\circ}$, 180$^{\circ}$, and 270$^{\circ}$ from the generated image at step (c), which are inputted to the VLM (see Fig. \ref{['fig:pipeline']}).
  • Figure 4: Diverse peg-hole pairs used for our experiments.
  • Figure 5: Responses (Yes or No shown as Y or N) and associated probabilities for each peg-hole pair, generated by our method for industrial connectors. The three shades of green represent top-1, top-2, and top-3. The table shows that top-3 accuracy is $100\%$, so the robot can insert the peg into the mating hole within $3$ trials if we allow multiple trials.
  • ...and 3 more figures