Table of Contents
Fetching ...

ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models

Peiming Li, Ziyi Wang, Mengyuan Liu, Hong Liu, Chen Chen

TL;DR

ClickDiff tackles fine-grained hand-object contact in grasp generation by introducing Semantic Contact Map (SCM) as a controllable representation, enabling user-specified or algorithmically predicted contacts. It employs a Dual Generation Framework with a Semantic Conditional Module and a Contact Conditional Module, guided by a Tactile-Guided Constraint within a diffusion-model setup to synthesize realistic grasps. Experiments on GRAB and ARCTIC show improved contact fidelity, higher success, and robustness to unseen objects, with code available at the provided GitHub link. This approach advances controllable, physically plausible hand-object interactions for both unimanual and bimanual manipulation.

Abstract

Grasp generation aims to create complex hand-object interactions with a specified object. While traditional approaches for hand generation have primarily focused on visibility and diversity under scene constraints, they tend to overlook the fine-grained hand-object interactions such as contacts, resulting in inaccurate and undesired grasps. To address these challenges, we propose a controllable grasp generation task and introduce ClickDiff, a controllable conditional generation model that leverages a fine-grained Semantic Contact Map (SCM). Particularly when synthesizing interactive grasps, the method enables the precise control of grasp synthesis through either user-specified or algorithmically predicted Semantic Contact Map. Specifically, to optimally utilize contact supervision constraints and to accurately model the complex physical structure of hands, we propose a Dual Generation Framework. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information, while the Contact Conditional Module utilizes contact maps alongside object point clouds to generate realistic grasps. We evaluate the evaluation criteria applicable to controllable grasp generation. Both unimanual and bimanual generation experiments on GRAB and ARCTIC datasets verify the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects. Our code is available at https://github.com/adventurer-w/ClickDiff.

ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models

TL;DR

ClickDiff tackles fine-grained hand-object contact in grasp generation by introducing Semantic Contact Map (SCM) as a controllable representation, enabling user-specified or algorithmically predicted contacts. It employs a Dual Generation Framework with a Semantic Conditional Module and a Contact Conditional Module, guided by a Tactile-Guided Constraint within a diffusion-model setup to synthesize realistic grasps. Experiments on GRAB and ARCTIC show improved contact fidelity, higher success, and robustness to unseen objects, with code available at the provided GitHub link. This approach advances controllable, physically plausible hand-object interactions for both unimanual and bimanual manipulation.

Abstract

Grasp generation aims to create complex hand-object interactions with a specified object. While traditional approaches for hand generation have primarily focused on visibility and diversity under scene constraints, they tend to overlook the fine-grained hand-object interactions such as contacts, resulting in inaccurate and undesired grasps. To address these challenges, we propose a controllable grasp generation task and introduce ClickDiff, a controllable conditional generation model that leverages a fine-grained Semantic Contact Map (SCM). Particularly when synthesizing interactive grasps, the method enables the precise control of grasp synthesis through either user-specified or algorithmically predicted Semantic Contact Map. Specifically, to optimally utilize contact supervision constraints and to accurately model the complex physical structure of hands, we propose a Dual Generation Framework. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information, while the Contact Conditional Module utilizes contact maps alongside object point clouds to generate realistic grasps. We evaluate the evaluation criteria applicable to controllable grasp generation. Both unimanual and bimanual generation experiments on GRAB and ARCTIC datasets verify the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects. Our code is available at https://github.com/adventurer-w/ClickDiff.
Paper Structure (25 sections, 12 equations, 4 figures, 5 tables)

This paper contains 25 sections, 12 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Previous works face issues of contact ambiguity, where an input object could lead to multiple undesired grasps, such as when a bowl is expected to be grabbed from the bottom, revealing the importance of controllable grasp generation. By manually determining contact points and specifying contact fingers, one can obtain SCM (each color represents the contact area of a different finger) by traversing the area around the clicked point. Finally, by utilizing SCM, it's possible to achieve accurate user-expected grasp.
  • Figure 2: Overview of ClickDiff: The model initially takes an object's point cloud as input and predicts the contact map conditioned on the Semantic Contact Map within the Semantic Conditional Module. Subsequently, the predicted contact map is fed into the Contact Conditional Module, where grasping is generated under the guidance of TGC and contact map.
  • Figure 3: Illustration of the Semantic Contact Map. The fingers are divided into five parts, represented by different colors. The SCM indicates the points on the object that are being touched and the finger parts touching these points. Each point may be touched by more than one finger.
  • Figure 4: Qualitative comparison results on GRAB dataset Taheri_2020. While GA and CG produce unnatural distortions and huge contact deviations, our method produces more plausible and accurate grasps for unseen objects.