Text2Grasp: Grasp synthesis by text prompts of object grasping parts
Xiaoyun Chang, Yi Sun
TL;DR
Text2Grasp tackles the ambiguity of task- and intention-based grasping by grounding grasp synthesis in text prompts that specify object parts to grasp. It introduces a two-stage pipeline: a text-guided diffusion model TextGraspDiff generates a coarse grasp, followed by a text-guided hand-object contact optimization to ensure plausibility, diversity, and alignment with the specified part. The method leverages LLMs to expand prompts into personalized and task-level descriptions without extra annotations, enabling flexible control. Experimental results on OakInk and AffordPose demonstrate competitive grasp quality and precise part-level control, outperforming baselines on penetration and contact while delivering diverse grasps.
Abstract
The hand plays a pivotal role in human ability to grasp and manipulate objects and controllable grasp synthesis is the key for successfully performing downstream tasks. Existing methods that use human intention or task-level language as control signals for grasping inherently face ambiguity. To address this challenge, we propose a grasp synthesis method guided by text prompts of object grasping parts, Text2Grasp, which provides more precise control. Specifically, we present a two-stage method that includes a text-guided diffusion model TextGraspDiff to first generate a coarse grasp pose, then apply a hand-object contact optimization process to ensure both plausibility and diversity. Furthermore, by leveraging Large Language Model, our method facilitates grasp synthesis guided by task-level and personalized text descriptions without additional manual annotations. Extensive experiments demonstrate that our method achieves not only accurate part-level grasp control but also comparable performance in grasp quality.
