Grasp as You Say: Language-guided Dexterous Grasp Generation

Yi-Lin Wei; Jian-Jian Jiang; Chengyi Xing; Xian-Tuo Tan; Xiao-Ming Wu; Hao Li; Mark Cutkosky; Wei-Shi Zheng

Grasp as You Say: Language-guided Dexterous Grasp Generation

Yi-Lin Wei, Jian-Jian Jiang, Chengyi Xing, Xian-Tuo Tan, Xiao-Ming Wu, Hao Li, Mark Cutkosky, Wei-Shi Zheng

TL;DR

A language-guided dexterous grasp dataset, named DexGYSNet, offering high-quality dexterous grasp annotations along with flexible and fine-grained human language guidance, is proposed, offering high-quality dexterous grasp annotations along with flexible and fine-grained human language guidance.

Abstract

This paper explores a novel task "Dexterous Grasp as You Say" (DexGYS), enabling robots to perform dexterous grasping based on human commands expressed in natural language. However, the development of this field is hindered by the lack of datasets with natural human guidance; thus, we propose a language-guided dexterous grasp dataset, named DexGYSNet, offering high-quality dexterous grasp annotations along with flexible and fine-grained human language guidance. Our dataset construction is cost-efficient, with the carefully-design hand-object interaction retargeting strategy, and the LLM-assisted language guidance annotation system. Equipped with this dataset, we introduce the DexGYSGrasp framework for generating dexterous grasps based on human language instructions, with the capability of producing grasps that are intent-aligned, high quality and diversity. To achieve this capability, our framework decomposes the complex learning process into two manageable progressive objectives and introduce two components to realize them. The first component learns the grasp distribution focusing on intention alignment and generation diversity. And the second component refines the grasp quality while maintaining intention consistency. Extensive experiments are conducted on DexGYSNet and real world environments for validation.

Grasp as You Say: Language-guided Dexterous Grasp Generation

TL;DR

Abstract

Paper Structure (41 sections, 16 equations, 15 figures, 4 tables)

This paper contains 41 sections, 16 equations, 15 figures, 4 tables.

Introduction
Related work
Dexterous Grasp Generation
Grasp Datasets
Language-guided Robot Grasp
DexGYSNet Dataset
Dataset Overview
Hand-Object Interaction Retargeting
LLM-assisted Language Guidance Annotation
DexGYSGrasp framework
Progressive Grasp Objectives.
Progressive Grasp Components
Progressive Grasp Loss
Experiments
Datasets and Evaluation Metrics
...and 26 more sections

Figures (15)

Figure 1: Our Language-guided Task vs. Traditional Dexterous Grasp Tasks. Traditional methods focus either solely on grasp quality or on fixed and limited functionalities. Our approach enables the generation of dexterous grasps based on human language, enhancing natural human-robot interactions.
Figure 2: Visualization of the impact of penetration loss (Pen. in the figure) on grasp performance: intention alignment, quality, and diversity. (a) illustrates penetration loss causes intention misalignment and its absence results in severe object penetration. (b) shows three sampling results under the same conditions, and demonstrates that penetration loss leads to reduced diversity.
Figure 3: The construction process of the DexGYSNet dataset. (a) The HOIR strategy retargets the human hand to the dexterous hand by three step, maintaining hand-object interaction consistency and avoiding physical infeasibility (shown in black circle). (b) The annotation system automatically annotates language guidance for hand-object pairs with the help of LLM.
Figure 4: Quantitative experimental results with different object penetration loss weights $\lambda_{pen}$. Intention is quantified by the Chamfer distance (CD) between predictions and targets. Diversity is assessed by the standard deviation of hand translation $\delta_{t}$. Object penetration is evaluated by the penetration depth (Pen.) from the object point cloud to the hand mesh. Our method uniquely achieves high performance in terms of intention consistency, diversity, and penetration avoidance.
Figure 5: Overview of our framework. (a) With only the regression loss, intention and diversity grasp component is trained to reconstruct the original hand pose from the noise poses, based on language and object condition. (b) With both regression and penetration losses, Quality Grasp Component is trained to refine the coarse pose improve the grasp quality while maintain intension consistency.
...and 10 more figures

Grasp as You Say: Language-guided Dexterous Grasp Generation

TL;DR

Abstract

Grasp as You Say: Language-guided Dexterous Grasp Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)