Table of Contents
Fetching ...

Text2Grasp: Grasp synthesis by text prompts of object grasping parts

Xiaoyun Chang, Yi Sun

TL;DR

Text2Grasp tackles the ambiguity of task- and intention-based grasping by grounding grasp synthesis in text prompts that specify object parts to grasp. It introduces a two-stage pipeline: a text-guided diffusion model TextGraspDiff generates a coarse grasp, followed by a text-guided hand-object contact optimization to ensure plausibility, diversity, and alignment with the specified part. The method leverages LLMs to expand prompts into personalized and task-level descriptions without extra annotations, enabling flexible control. Experimental results on OakInk and AffordPose demonstrate competitive grasp quality and precise part-level control, outperforming baselines on penetration and contact while delivering diverse grasps.

Abstract

The hand plays a pivotal role in human ability to grasp and manipulate objects and controllable grasp synthesis is the key for successfully performing downstream tasks. Existing methods that use human intention or task-level language as control signals for grasping inherently face ambiguity. To address this challenge, we propose a grasp synthesis method guided by text prompts of object grasping parts, Text2Grasp, which provides more precise control. Specifically, we present a two-stage method that includes a text-guided diffusion model TextGraspDiff to first generate a coarse grasp pose, then apply a hand-object contact optimization process to ensure both plausibility and diversity. Furthermore, by leveraging Large Language Model, our method facilitates grasp synthesis guided by task-level and personalized text descriptions without additional manual annotations. Extensive experiments demonstrate that our method achieves not only accurate part-level grasp control but also comparable performance in grasp quality.

Text2Grasp: Grasp synthesis by text prompts of object grasping parts

TL;DR

Text2Grasp tackles the ambiguity of task- and intention-based grasping by grounding grasp synthesis in text prompts that specify object parts to grasp. It introduces a two-stage pipeline: a text-guided diffusion model TextGraspDiff generates a coarse grasp, followed by a text-guided hand-object contact optimization to ensure plausibility, diversity, and alignment with the specified part. The method leverages LLMs to expand prompts into personalized and task-level descriptions without extra annotations, enabling flexible control. Experimental results on OakInk and AffordPose demonstrate competitive grasp quality and precise part-level control, outperforming baselines on penetration and contact while delivering diverse grasps.

Abstract

The hand plays a pivotal role in human ability to grasp and manipulate objects and controllable grasp synthesis is the key for successfully performing downstream tasks. Existing methods that use human intention or task-level language as control signals for grasping inherently face ambiguity. To address this challenge, we propose a grasp synthesis method guided by text prompts of object grasping parts, Text2Grasp, which provides more precise control. Specifically, we present a two-stage method that includes a text-guided diffusion model TextGraspDiff to first generate a coarse grasp pose, then apply a hand-object contact optimization process to ensure both plausibility and diversity. Furthermore, by leveraging Large Language Model, our method facilitates grasp synthesis guided by task-level and personalized text descriptions without additional manual annotations. Extensive experiments demonstrate that our method achieves not only accurate part-level grasp control but also comparable performance in grasp quality.
Paper Structure (14 sections, 8 equations, 11 figures, 2 tables)

This paper contains 14 sections, 8 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Given an object, Text2Grasp can generate specific hand grasps by interpreting various text inputs: a) Template text. b) Personalized text. c) Task-level text.
  • Figure 2: The Overview of Text2Grasp. We present a semi-automatic approach to generate both the template text and the personalized text prompts for each grasp in the datasets, which are used to train TexGraspDiff. And given the point cloud of object and text description of object grasping parts, we introduce a two-stage method that includes a text-guided diffusion model TextGraspDiff to first generate a coarse grasp pose, then apply a hand-object contact optimization process to ensure both plausibility and diversity. The final hand mesh can be obtained by MANO modelromero2017embodied.
  • Figure 3: The Contact optimization. The contact optimization consists of finger perception and object part perception. The finger perception optimization directs the particular fingers used for grasping towards object and the object part optimization guided fingers toward the object part specified by text.
  • Figure 4: The qualitative results on the OakInkyang2022oakink dataset and the AffordPose jian2023affordpose dataset. The results demonstrated above the dotted line are from OakInk yang2022oakink dataset, while below are from AffordPose jian2023affordpose dataset.
  • Figure 5: The qualitative results of the diverse grasps on the objects. For each object, we visualize five grasps and the red shape represents abnormal grasps.
  • ...and 6 more figures