Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE
Mengshi Qi, Zhe Zhao, Huadong Ma
TL;DR
This paper addresses the challenge of generating realistic human grasps for both rigid and deformable objects by introducing a part-aware Decomposed VQ-VAE-2 (DVQ-VAE-2) that encodes each hand component with separate discrete codebooks and employs a dual-stage autoregressive decoding to separate posture from position. It further extends deformation modeling with Mesh UFormer as a backbone to process object meshes and a normal vector-guided position encoding to quantify hand-object interactions, enabling plausible deformations of non-rigid objects. The approach achieves substantial improvements in grasp quality and diversity across multiple rigid benchmarks and outperforms deformable-object baselines in deformation accuracy, while maintaining efficiency through autoregressive inference and mesh-based representations. The results demonstrate strong potential for applications in robotics and embodied AI, with code and models released for reproducibility and further development.
Abstract
Generating realistic human grasps is crucial yet challenging for object manipulation in computer graphics and robotics. Current methods often struggle to generate detailed and realistic grasps with full finger-object interaction, as they typically rely on encoding the entire hand and estimating both posture and position in a single step. Additionally, simulating object deformation during grasp generation is still difficult, as modeling such deformation requires capturing the comprehensive relationship among points of the object's surface. To address these limitations, we propose a novel improved Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE-2), which decomposes the hand into distinct parts and encodes them separately. This part-aware architecture allows for more precise management of hand-object interactions. Furthermore, we introduce a dual-stage decoding strategy that first predicts the grasp type under skeletal constraints and then identifies the optimal grasp position, enhancing both the realism and adaptability of the model to unseen interactions. Furthermore, we introduce a new Mesh UFormer as the backbone network to extract the hierarchical structural representations from the mesh and propose a new normal vector-guided position encoding to simulate the hand-object deformation. In experiments, our model achieves a relative improvement of approximately 14.1% in grasp quality compared to state-of-the-art methods across four widely used benchmarks. Our comparisons with other backbone networks show relative improvements of 2.23% in Hand-object Contact Distance and 5.86% in Quality Index on deformable and rigid object based datasets, respectively. Our source code and model are available at https://github.com/florasion/D-VQVAE.
