UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Yinzhen Xu; Weikang Wan; Jialiang Zhang; Haoran Liu; Zikang Shan; Hao Shen; Ruicheng Wang; Haoran Geng; Yijia Weng; Jiayi Chen; Tengyu Liu; Li Yi; He Wang

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Yinzhen Xu, Weikang Wan, Jialiang Zhang, Haoran Liu, Zikang Shan, Hao Shen, Ruicheng Wang, Haoran Geng, Yijia Weng, Jiayi Chen, Tengyu Liu, Li Yi, He Wang

TL;DR

UniDexGrasp tackles universal dexterous grasping from table-top depth observations by splitting the problem into dexterous grasp proposal generation and goal-conditioned execution. It introduces GraspIPDF and GraspGlow to decouple rotation from translation/articulation, with ContactNet refining physically plausible grasps, and a teacher-student policy framework to enable vision-based execution across hundreds of object categories. The approach yields over 60% average success in simulation, outperforms baselines, and enables language-guided functional grasps through CLIP-based filtering. The authors provide a large-scale dataset and code, underscoring the method's potential for robust, cross-category dexterous manipulation in realistic settings.

Abstract

In this work, we tackle the problem of learning universal robotic dexterous grasping from a point cloud observation under a table-top setting. The goal is to grasp and lift up objects in high-quality and diverse ways and generalize across hundreds of categories and even the unseen. Inspired by successful pipelines used in parallel gripper grasping, we split the task into two stages: 1) grasp proposal (pose) generation and 2) goal-conditioned grasp execution. For the first stage, we propose a novel probabilistic model of grasp pose conditioned on the point cloud observation that factorizes rotation from translation and articulation. Trained on our synthesized large-scale dexterous grasp dataset, this model enables us to sample diverse and high-quality dexterous grasp poses for the object point cloud.For the second stage, we propose to replace the motion planning used in parallel gripper grasping with a goal-conditioned grasp policy, due to the complexity involved in dexterous grasping execution. Note that it is very challenging to learn this highly generalizable grasp policy that only takes realistic inputs without oracle states. We thus propose several important innovations, including state canonicalization, object curriculum, and teacher-student distillation. Integrating the two stages, our final pipeline becomes the first to achieve universal generalization for dexterous grasping, demonstrating an average success rate of more than 60\% on thousands of object instances, which significantly outperforms all baselines, meanwhile showing only a minimal generalization gap.

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

TL;DR

Abstract

Paper Structure (34 sections, 12 equations, 8 figures, 9 tables)

This paper contains 34 sections, 12 equations, 8 figures, 9 tables.

Introduction
Related Work
Method
Problem Settings and Method Overview
Dexterous Grasp Proposal Generation
GraspIPDF: Grasp Orientation Generation
GraspGlow: Grasp Translation and Articulation Generation given Orientation
End-to-End Training with ContactNet
Test-Time Contact-based Optimization
Goal-Conditioned Dexterous Grasping Policy
Learning Teacher Policy
Distilling to the Vision-based Student Policy
Experimentals
Data Generation and Statistics
Results on Grasp Proposal Generation
...and 19 more sections

Figures (8)

Figure 1: Method overview. The left part is the first stage, which generates a dexterous grasp proposal. The input is the object point cloud at time step 0, $X_0$, fused from depth images, with ground truth segmentation of the table and the object. A rotation $R$ is sampled from the distribution implied by the GraspIPDF, and the point cloud will be canonicalized by $R^{-1}$ to $\tilde{X}_0$. The GraspGlow then samples the translation $\tilde{\bm{t}}$ and joint angles $\bm{q}$. Next, the ContactNet takes $\tilde{X}_0$ and a point cloud $\tilde{X}_H$ sampled from the hand to predict the ideal contact map $\bm{c}$ on the object. Then, the predicted hand pose is optimized based on the contact information. The final goal pose is transformed by $R$ to align with the original visual observation. The right part is the second stage, the goal-conditioned dexterous grasping policy that takes the goal $\bm{g}$, point cloud $X_t$ and robot proprioception $\bm{s}^r_t$ to take actions accordingly.
Figure 2: The goal-conditioned dexterous grasping policy pipeline. $\widetilde{{\mathcal{S}}^{\mathcal{E}}_t}=(\widetilde{\bm{s}^r_t},\widetilde{\bm{s}^o_t},X^O,\widetilde{g})$ and $\widetilde{{\mathcal{S}}^{\mathcal{S}}_t}=(\widetilde{\bm{s}^r_t},\widetilde{X_t},\widetilde{g})$ denote the input state of the teacher policy and student policy after state canonicalization, respectively; $\oplus$ denotes concatenation.
Figure 3: Comparison of diversity in grasp translation and articulation given the rotation. Left: 8 outputs of CVAE (completely collapsed to one pose); Middle: 8 outputs of GraspGLOW; Right: a ground truth grasp.
Figure 4: Qualitative results of language-guided grasp proposal selection. CLIP can select proposals complying with the language instruction, allowing the goal-conditioned policy to execute potentially functional grasps.
Figure 5: Our object dataset contains more than five thousand objects from various categories. These are the visualization of some decomposed meshes.
...and 3 more figures

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

TL;DR

Abstract

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Authors

TL;DR

Abstract

Table of Contents

Figures (8)