Table of Contents
Fetching ...

RegionGrasp: A Novel Task for Contact Region Controllable Hand Grasp Generation

Yilin Wang, Chuan Guo, Li Cheng, Hai Jiang

TL;DR

RegionGrasp tackles the problem of region-controllable hand grasp generation by proposing RegionGrasp-CVAE, a conditional variational autoencoder equipped with ConditionNet for region-aware object encoding and HOINet for interaction-aware hand-object coupling. The approach uses point-patch representations and a pretraining strategy to capture geometry, enabling low-level spatial control over the contact region and robust hand-object interactions. Across ObMan and GRAB datasets, RegionGrasp-CVAE achieves competitive region controllability (CR) and contact quality (CCA/IV), while delivering diverse grasps and good generalization to out-of-domain objects; user studies corroborate improved controllability without sacrificing naturalness. This work advances practical region-specific grasp synthesis for applications like VR, and suggests future integration with physics priors and language-informed priors to further enhance plausibility and control.

Abstract

Can machine automatically generate multiple distinct and natural hand grasps, given specific contact region of an object in 3D? This motivates us to consider a novel task of \textit{Region Controllable Hand Grasp Generation (RegionGrasp)}, as follows: given as input a 3D object, together with its specific surface area selected as the intended contact region, to generate a diverse set of plausible hand grasps of the object, where the thumb finger tip touches the object surface on the contact region. To address this task, RegionGrasp-CVAE is proposed, which consists of two main parts. First, to enable contact region-awareness, we propose ConditionNet as the condition encoder that includes in it a transformer-backboned object encoder, O-Enc; a pretraining strategy is adopted by O-Enc, where the point patches of object surface are randomly masked off and subsequently restored, to further capture surface geometric information of the object. Second, to realize interaction awareness, HOINet is introduced to encode hand-object interaction features by entangling high-level hand features with embedded object features through geometric-aware multi-head cross attention. Empirical evaluations demonstrate the effectiveness of our approach qualitatively and quantitatively where it is shown to compare favorably with respect to the state of the art methods.

RegionGrasp: A Novel Task for Contact Region Controllable Hand Grasp Generation

TL;DR

RegionGrasp tackles the problem of region-controllable hand grasp generation by proposing RegionGrasp-CVAE, a conditional variational autoencoder equipped with ConditionNet for region-aware object encoding and HOINet for interaction-aware hand-object coupling. The approach uses point-patch representations and a pretraining strategy to capture geometry, enabling low-level spatial control over the contact region and robust hand-object interactions. Across ObMan and GRAB datasets, RegionGrasp-CVAE achieves competitive region controllability (CR) and contact quality (CCA/IV), while delivering diverse grasps and good generalization to out-of-domain objects; user studies corroborate improved controllability without sacrificing naturalness. This work advances practical region-specific grasp synthesis for applications like VR, and suggests future integration with physics priors and language-informed priors to further enhance plausibility and control.

Abstract

Can machine automatically generate multiple distinct and natural hand grasps, given specific contact region of an object in 3D? This motivates us to consider a novel task of \textit{Region Controllable Hand Grasp Generation (RegionGrasp)}, as follows: given as input a 3D object, together with its specific surface area selected as the intended contact region, to generate a diverse set of plausible hand grasps of the object, where the thumb finger tip touches the object surface on the contact region. To address this task, RegionGrasp-CVAE is proposed, which consists of two main parts. First, to enable contact region-awareness, we propose ConditionNet as the condition encoder that includes in it a transformer-backboned object encoder, O-Enc; a pretraining strategy is adopted by O-Enc, where the point patches of object surface are randomly masked off and subsequently restored, to further capture surface geometric information of the object. Second, to realize interaction awareness, HOINet is introduced to encode hand-object interaction features by entangling high-level hand features with embedded object features through geometric-aware multi-head cross attention. Empirical evaluations demonstrate the effectiveness of our approach qualitatively and quantitatively where it is shown to compare favorably with respect to the state of the art methods.

Paper Structure

This paper contains 22 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of the region controllable hand grasp generation task. Given an object with a specific condition region, the task requires automatic generation of diverse and natural hand grasps with the thumb finger tip in contact with the condition region. Bottom 2 rows display the generated hand grasps from our proposed method RegionGrasp-CVAE, presenting great diversity and controllability on both in domain and out of domain objects.
  • Figure 2: An overview of our RegionGrasp-CVAE framework with training and inference pipeline. The input object are resampled to point cloud and grouped into point patches. The binary condition region mask is generated based on the point patches to distinguish the condition region from other object regions. The ConditionNet embeds geometric-aware object tokens through O-Enc, which are then partially masked by condition region mask and finally encoded as a global region-aware feature vector $z_c$ by Condition Region Encoder. During training, hand tokens embedded by H-Enc interact with object tokens from O-Enc through HOI encoder to encode hand-object interaction(HOI) features $f_I$ that yields the posterior distribution $Q(z|P_o, p_c, R, V_h, e_h)$. A latent vector is sampled from the posterior distribution, concatenated with $z_c$, and then mapped back to the hand mesh space through VAE decoder and MANO model. During inference, the latent code $z$ is randomly sampled from standard Gaussian distribution. The VAE-Decoder takes the concatenation of $z_c$ with a sampled latent code vector $z$ to generate MANO hand parameters which are then regressed to the output hand shape.
  • Figure 3: An overview of our RegionGrasp-CVAE framework. (a) Pretrain pipeline for O-Enc based on mask auto-encoding. (b) Elaborated architecture of GA-MHSA and GA-MHCA blocks in the HOI Encoder, the core component of our HOINet.
  • Figure 4: Qualitative comparison with GraspTTAgraspTTA.The best 2 grasps are selected for each method tested on in-domain / out-of-domain objects.
  • Figure 5: Generated hand grasps given different objects/condition regions from ObManhasson19_obman(in domain) and GRABGRAB(out of domain). See supp. for more 3D demos.
  • ...and 2 more figures