Table of Contents
Fetching ...

SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation

Haosheng Li, Weixin Mao, Weipeng Deng, Chenyu Meng, Rui Zhang, Fan Jia, Tiancai Wang, Haoqiang Fan, Hongan Wang, Xiaoming Deng

TL;DR

A training-free framework that incorporates both semantic and geometric priors for zero-shot task-oriented grasp generation, SegGrasp, which first leverages the vision-language models like GLIP for coarse segmentation and uses detailed geometric information from convex decomposition to improve segmentation quality through a fusion policy named GeoFusion.

Abstract

Task-oriented grasping, which involves grasping specific parts of objects based on their functions, is crucial for developing advanced robotic systems capable of performing complex tasks in dynamic environments. In this paper, we propose a training-free framework that incorporates both semantic and geometric priors for zero-shot task-oriented grasp generation. The proposed framework, SegGrasp, first leverages the vision-language models like GLIP for coarse segmentation. It then uses detailed geometric information from convex decomposition to improve segmentation quality through a fusion policy named GeoFusion. An effective grasp pose can be generated by a grasping network with improved segmentation. We conducted the experiments on both segmentation benchmark and real-world robot grasping. The experimental results show that SegGrasp surpasses the baseline by more than 15\% in grasp and segmentation performance.

SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation

TL;DR

A training-free framework that incorporates both semantic and geometric priors for zero-shot task-oriented grasp generation, SegGrasp, which first leverages the vision-language models like GLIP for coarse segmentation and uses detailed geometric information from convex decomposition to improve segmentation quality through a fusion policy named GeoFusion.

Abstract

Task-oriented grasping, which involves grasping specific parts of objects based on their functions, is crucial for developing advanced robotic systems capable of performing complex tasks in dynamic environments. In this paper, we propose a training-free framework that incorporates both semantic and geometric priors for zero-shot task-oriented grasp generation. The proposed framework, SegGrasp, first leverages the vision-language models like GLIP for coarse segmentation. It then uses detailed geometric information from convex decomposition to improve segmentation quality through a fusion policy named GeoFusion. An effective grasp pose can be generated by a grasping network with improved segmentation. We conducted the experiments on both segmentation benchmark and real-world robot grasping. The experimental results show that SegGrasp surpasses the baseline by more than 15\% in grasp and segmentation performance.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison of the baseline method and our method. The baseline method which consists of SATR and Contact-GraspNet often generates incorrect grasp poses due to mis-segmentation. In contrast, our approach produces cleaner and more precise segmentation without any unknown region, resulting in better grasp pose generation.
  • Figure 2: The overall architecture of SegGrasp. Given a target object, our method renders the mesh from random viewpoints. We then utilize a vision-language model, such as Grounding DINO, to detect bounding boxes in the image and create coarse segmentation. The mesh is decomposed into multiple parts using various decomposition thresholds, each resulting in different segmentations. Using GeoFusion, these segmentations are fused with the initial coarse segmentation to achieve a refined segmentation. Finally, the refined segmentation faciliates Contact-GraspNet to generate high-quality grasp poses.
  • Figure 3: The detail of GeoFusion. The upper section shows GeoFusion starting with coarse segmentation and convex decomposition under different decomposition thresholds. After multi-fusion ('Fusion') and fine-grained optimization ('Opt'), the unknown parts of the knife decrease, and the segmentation results improves. The color of each square indicates the category score of the faces, corresponding to the value in the matrix $S$. Fine-grained optimization can effectively enforce that faces within the same segment belong to the same object part.
  • Figure 4: Overview of our SegGraspNet dataset which includes 9 common categories from everyday life.
  • Figure 5: The details of our robot experiments. (a) shows one of our robot grasping setups, consisting of a UR5 arm, RealSense D456 camera, and Robotiq gripper. (b) demonstrates grasp results from our method on the SegGraspSet dataset.
  • ...and 1 more figures