Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

Yaoxian Song; Penglei Sun; Piaopiao Jin; Yi Ren; Yu Zheng; Zhixu Li; Xiaowen Chu; Yue Zhang; Tiefeng Li; Jason Gu

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

Yaoxian Song, Penglei Sun, Piaopiao Jin, Yi Ren, Yu Zheng, Zhixu Li, Xiaowen Chu, Yue Zhang, Tiefeng Li, Jason Gu

TL;DR

This work tackles fine-grained 6-DoF grasping at the object-part level by grounding 3D parts through natural language and leveraging large language models. It introduces LangSHAPE, a large-scale language-pointcloud-grasp dataset, and LangPartGPD, a two-stage framework consisting of a 3D part language grounding module and a part-aware grasp pose detection module. The approach enables language-guided sampling within semantically grounded regions, achieving improved part-specific and part-agnostic grasp success in both simulated and real-robot experiments, and demonstrates notable generalization to novel objects and parts. By integrating explicit language as a symbolic intermediate and utilizing LLM-driven reasoning, the method advances open-world, interpretable affordance reasoning for robotic manipulation.

Abstract

Robotic grasping is a fundamental ability for a robot to interact with the environment. Current methods focus on how to obtain a stable and reliable grasping pose in object level, while little work has been studied on part (shape)-wise grasping which is related to fine-grained grasping and robotic affordance. Parts can be seen as atomic elements to compose an object, which contains rich semantic knowledge and a strong correlation with affordance. However, lacking a large part-wise 3D robotic dataset limits the development of part representation learning and downstream applications. In this paper, we propose a new large Language-guided SHape grAsPing datasEt (named LangSHAPE) to promote 3D part-level affordance and grasping ability learning. From the perspective of robotic cognition, we design a two-stage fine-grained robotic grasping framework (named LangPartGPD), including a novel 3D part language grounding model and a part-aware grasp pose detection model, in which explicit language input from human or large language models (LLMs) could guide a robot to generate part-level 6-DoF grasping pose with textual explanation. Our method combines the advantages of human-robot collaboration and LLMs' planning ability using explicit language as a symbolic intermediate. To evaluate the effectiveness of our proposed method, we perform 3D part grounding and fine-grained grasp detection experiments on both simulation and physical robot settings, following language instructions across different degrees of textual complexity. Results show our method achieves competitive performance in 3D geometry fine-grained grounding, object affordance inference, and 3D part-aware grasping tasks. Our dataset and code are available on our project website https://sites.google.com/view/lang-shape

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 8 figures, 7 tables)

This paper contains 25 sections, 3 equations, 8 figures, 7 tables.

Introduction
Related work
Language Grounding in Robotic Manipulation
Affordance in Robotic Grasping Detection
Problem Statement
Data Generation
Point Cloud Observation with Semantics
Grasp Sampling and Labeling
Part-affordance-based Language Generation
Fine-grained Grasping Method
3D Part Language Grounding
Part-aware Grasp Pose Detection
Training and Inference
Experiment
Data Organization
...and 10 more sections

Figures (8)

Figure 1: Illustration of 6-DoF grasp detection. Most of the existing work focuses on the graspability of grasp pose detection via visual perception, while ignoring grasp semantics for grasping functionality and explainability of decision. Our method attempts to overcome these limitations by introducing an intermediate process (i.e., 3D part language grounding) using explicit natural language from humans or LLMs.
Figure 2: An example of LangSHAPE with point cloud observation $\mathbb{C}$ under 1st time random placement, 6-DoF grasping pose with label $\mathbb{G}$, and natural language $\mathbb{Q}$ about part, object and affordance for grasping.
Figure 3: The pipeline of language generation in LangSHAPE. It includes three key steps. The first is to generate various sentence templates by prompt engineering. The second is to collect corpus about object, part, and affordance from the open-world knowledge base. The third is to inject corpus into the template to generate sentences and polish them in expression and grammar.
Figure 4: The overall architecture of LangPartGPD. black arrow trace refers to 3D part language grounding. The red arrow trace refers to part-aware grasp pose detection. Multiple object observations with point cloud are collected, ICP and downsampled before fed into LangPartGPD.
Figure 5: The designed prompts of LangPartGPD-ChatGPT/Flan-T5 to predict grasped part based on corrupted instruction.
...and 3 more figures

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

TL;DR

Abstract

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)