Dexterous Grasp Transformer

Guo-Hao Xu; Yi-Lin Wei; Dian Zheng; Xiao-Ming Wu; Wei-Shi Zheng

Dexterous Grasp Transformer

Guo-Hao Xu, Yi-Lin Wei, Dian Zheng, Xiao-Ming Wu, Wei-Shi Zheng

TL;DR

Dexterous Grasp Transformer (DGTR) reframes dexterous grasp generation as set prediction and uses a transformer decoder with learnable grasp queries to predict a diverse set of high-quality grasps in one forward pass. To overcome optimization challenges inherent to set-based learning and penetration penalties, it introduces Dynamic-Static Matching Training (DSMT) and Adversarial-Balanced Test-Time Adaptation (AB-TTA), achieving improved stability, diversity, and feasibility on DexGraspNet. Quantitative results show DGTR outperforms state-of-the-art one-shot methods in grasp quality and diversity while maintaining efficiency, with ablations confirming the effectiveness of DSMT and AB-TTA. The work lays a foundation for rapid, robust dexterous grasp generation in real-world robotic manipulation, reducing computation and data preprocessing needs while expanding directional grasp diversity.$R$ and $t$ are treated within the $SO(3)$ and $\,\mathbb{R}^{3}$ spaces, respectively, and joint configurations are represented in $\mathbb{R}^{J}$ with $J=22$ for ShadowHand.

Abstract

In this work, we propose a novel discriminative framework for dexterous grasp generation, named Dexterous Grasp TRansformer (DGTR), capable of predicting a diverse set of feasible grasp poses by processing the object point cloud with only one forward pass. We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model for it. However, we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping and results in restricted performance. To address these issues, we propose progressive strategies for both the training and testing phases. First, the dynamic-static matching training (DSMT) strategy is presented to enhance the optimization stability during the training phase. Second, we introduce the adversarial-balanced test-time adaptation (AB-TTA) with a pair of adversarial losses to improve grasping quality during the testing phase. Experimental results on the DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous grasp poses with both high quality and diversity. Notably, while keeping high quality, the diversity of grasp poses predicted by DGTR significantly outperforms previous works in multiple metrics without any data pre-processing. Codes are available at https://github.com/iSEE-Laboratory/DGTR .

Dexterous Grasp Transformer

TL;DR

and

are treated within the

and

spaces, respectively, and joint configurations are represented in

with

for ShadowHand.

Abstract

Paper Structure (23 sections, 7 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 7 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Dexterous Grasp Generation
Vision Transformer
Dexterous Grasp Transformer
Problem Formulation
DGTR Architecture
Dynamic-Static Matching Training Strategy
Adversarial-Balanced Test-Time Adaptation
Grasp Losses
Experiments
Dataset and Evaluation Metrics
Implementation Details
Dexterous Grasp Generation Performance
Comparison with SOTA in one forward pass
...and 8 more sections

Figures (8)

Figure 1: Comparison of DGTR and other dexterous grasping frameworks. The generative models (a) usually learn the distribution of the grasp poses conditioned on the object point cloud. At test time, they mainly infer multiple times to generate several grasps but produce nearly identical grasp poses with the same condition. The vanilla discriminative models (b) mainly learn to predict one grasp pose for the input point cloud. Our DGTR model (c) adopts a transformer decoder and learnable queries, and learns to predict a set of diverse grasps poses with one forward pass.
Figure 2: Comparison of grasp quality and diversity under different penetration loss weights. We visualize 3 grasps for each circumstance. (a) large object penetration weight; (b) zero object penetration weight; (c) our progressive strategies.
Figure 3: Overview of our DGTR framework. The input of DGTR is the complete point cloud $\mathcal{O}$ of an object. First, the PointNet++ pointnet++ encoder downsamples the point cloud and extracts a set of object features. Next, the transformer decoder takes $N$ learnable query embeddings as well as the object features as input and predicts $N$ diverse grasp poses in parallel. In the dynamic matching training stage, our model is trained with the matching result produced by Hungarian Algorithm hungarian and without object penetration loss. In the static matching training stage, we use static matching recorded in the DMT stage to train the model with object penetration loss. At test time, we adopt an adversarial-balanced loss to directly finetune the hand pose parameters.
Figure 4: Comparative analysis of grasp poses similarity and object penetration with various penetration loss weights.Similarity is measured by the cosine similarity of $N$ predicted grasp poses, which represents the non-diversity. Penetration is the object penetration from the object point cloud to the hand mesh. Ours denotes the model trained with our proposed DSMT strategy.
Figure 5: Hungarian matching instability during training of different penetration loss weights. The instability is measured by the IS metric introduced in dn-detr, where a higher value indicates greater instability.
...and 3 more figures

Dexterous Grasp Transformer

TL;DR

Abstract

Dexterous Grasp Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (8)