Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE

Mengshi Qi; Zhe Zhao; Huadong Ma

Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE

Mengshi Qi, Zhe Zhao, Huadong Ma

TL;DR

This paper addresses the challenge of generating realistic human grasps for both rigid and deformable objects by introducing a part-aware Decomposed VQ-VAE-2 (DVQ-VAE-2) that encodes each hand component with separate discrete codebooks and employs a dual-stage autoregressive decoding to separate posture from position. It further extends deformation modeling with Mesh UFormer as a backbone to process object meshes and a normal vector-guided position encoding to quantify hand-object interactions, enabling plausible deformations of non-rigid objects. The approach achieves substantial improvements in grasp quality and diversity across multiple rigid benchmarks and outperforms deformable-object baselines in deformation accuracy, while maintaining efficiency through autoregressive inference and mesh-based representations. The results demonstrate strong potential for applications in robotics and embodied AI, with code and models released for reproducibility and further development.

Abstract

Generating realistic human grasps is crucial yet challenging for object manipulation in computer graphics and robotics. Current methods often struggle to generate detailed and realistic grasps with full finger-object interaction, as they typically rely on encoding the entire hand and estimating both posture and position in a single step. Additionally, simulating object deformation during grasp generation is still difficult, as modeling such deformation requires capturing the comprehensive relationship among points of the object's surface. To address these limitations, we propose a novel improved Decomposed Vector-Quantized Variational Autoencoder (DVQ-VAE-2), which decomposes the hand into distinct parts and encodes them separately. This part-aware architecture allows for more precise management of hand-object interactions. Furthermore, we introduce a dual-stage decoding strategy that first predicts the grasp type under skeletal constraints and then identifies the optimal grasp position, enhancing both the realism and adaptability of the model to unseen interactions. Furthermore, we introduce a new Mesh UFormer as the backbone network to extract the hierarchical structural representations from the mesh and propose a new normal vector-guided position encoding to simulate the hand-object deformation. In experiments, our model achieves a relative improvement of approximately 14.1% in grasp quality compared to state-of-the-art methods across four widely used benchmarks. Our comparisons with other backbone networks show relative improvements of 2.23% in Hand-object Contact Distance and 5.86% in Quality Index on deformable and rigid object based datasets, respectively. Our source code and model are available at https://github.com/florasion/D-VQVAE.

Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE

TL;DR

Abstract

Paper Structure (22 sections, 39 equations, 10 figures, 4 tables)

This paper contains 22 sections, 39 equations, 10 figures, 4 tables.

Introduction
Related Work
Grasp Generation for Rigid Objects
Overview
Object Encoder
Part-Aware Decomposed Architecture
Dual-Stage Decoding Strategy
Optimization
Grasp Generation for Deformable Objects
Mesh UFormer
Hand-Object Deformation Quantification
Optimization
Experiments
Datasets
Metrics
...and 7 more sections

Figures (10)

Figure 1: Illustration of our proposed grasp generation model. First, we employ Decomposed VQ-VAE-2 (DVQ-VAE-2) to learn the prior distributions of the object and each hand component (i.e., five fingers and the palm) during training. Specifically, the decoding process is divided into two stages: generating hand posture and generating hand position. During inference, we use autoregression guided by the object to generate realistic human grasps. While for deformable objects, we propose a new Mesh UFormer to simulate deformations.
Figure 2: Overall architecture of the proposed DVQ-VAE model, which follows the encoder-decoder paradigm. During training, the model takes hand vertices and object point clouds as inputs, and maps them to discrete latent spaces comprising seven codebooks (i.e., one for the object and six for different hand components) to generate hands capable of grasping objects. During inference, only object point clouds are used as input to generate hands grasping the given object.
Figure 3: The upper part of the figure illustrates the overall architecture of our DVQ-VAE-2 with Mesh UFormer used for deformation simulation. We employ our newly proposed hand-object contact quantification method to construct the input for each point on the object, combined with voxel-based down-sampling to create a hierarchical input. Our Mesh UFormer is then used to simulate the deformations. The lower part of the figure details the structure of the encoder and decoder in the Mesh UFormer. These modules use a simplified PointTransformer integrated with our proposed normal vector-guided position encoding, combined with Scatter Add and Gather operations to achieve feature forward mapping and backward mapping between layers.
Figure 4: Comparison between our proposed Hand-Object Contact Quantification method and traditional methods. Unlike simply using the hand-to-object distance or the object-to-hand distance, we utilize both distances simultaneously.
Figure 5: Performance comparison of our proposed method with other models in terms of high-quality ratio, concerning the penetration threshold across different models on the HO-3D dataset.
...and 5 more figures

Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE

TL;DR

Abstract

Human Grasp Generation for Rigid and Deformable Objects with Decomposed VQ-VAE

Authors

TL;DR

Abstract

Table of Contents

Figures (10)