Table of Contents
Fetching ...

DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

Yuliang Wu, Yanhan Lin, WengKit Lao, Yuhao Lin, Yi-Lin Wei, Wei-Shi Zheng, Ancong Wu

Abstract

To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose DexGrasp-Zero, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a Morphology-Aligned Graph Convolutional Network (MAGCN) to encode the graph for policy learning. MAGCN incorporates a Physical Property Injection mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82% average success rate on unseen objects.

DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping

Abstract

To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose DexGrasp-Zero, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a Morphology-Aligned Graph Convolutional Network (MAGCN) to encode the graph for policy learning. MAGCN incorporates a Physical Property Injection mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82% average success rate on unseen objects.
Paper Structure (60 sections, 25 equations, 12 figures, 7 tables)

This paper contains 60 sections, 25 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Paradigm comparison: prior approaches versus our method.(a) Prior paradigm: Existing methods cross24eccv train on a simplified and lossy unified state space. They output intermediate motion targets that require hand-specific retargeting models to convert into physical joint commands. This adds complexity and can lead to kinematically infeasible actions. (b) Our paradigm: We learn a single universal policy end-to-end. The policy operates on a lossless morphology-aligned graph representation and outputs actions in a hand-agnostic motion-primitive space. Physical commands are generated directly through a fixed hand-specific mapping $\mathcal{M}_h$, removing the need for trainable retargeting modules. (c) Real-world deployment on unseen hands validate the effectiveness and zero-shot transfer capability of our approach.
  • Figure 2: Universal hand representation. (a) Morphology-Aligned State Graph Representation: nodes correspond to anatomical units, edges follow kinematic chains, yielding a hand-agnostic semantic graph structure. (b) Schematic of three motion primitives (Flexion, Abduction, Axial Rotation) on a Schunk hand, showing their physical motion effects at representative joints.
  • Figure 3: Architecture of DexGrasp-Zero. At each time step $t$: (a) Morphology-Aligned Graph Encoder encodes hand-object state into node features $\mathbf{X}^{h}_{\text{node},t}$ and global feature $\mathbf{x}_{g,t}^{h}$ using a hand-specific graph (adjacency $\mathbf{A}^h$); a GCN with per-layer physical priors produces embeddings $\mathbf{E}_{\text{node},t}^{h}$ and $\mathbf{E}_{g,t}^{h}$. (b) Physical Property Encoder parses hand URDF to build a physical graph $\mathcal{G}_{\text{physical}}^{h}$ (joint limits, link lengths, etc.) and an activation mask $\mathbf{M}_{\text{activation}}^{h}$, encoded into $\mathbf{E}_{\text{p}}^{h}$ and fused into every GCN layer. (c) Decoder outputs motion primitives $\boldsymbol{\alpha}_{\text{prim}}^{h}$: wrist 6-DoF commands from $\mathbf{E}_{g,t}^{h}$ and wrist features, and joint actions from masked node embeddings; the latter are mapped via hand-specific $\mathcal{M}_h$ to executable joint commands $\alpha_{\text{physical},t}^{h}$.
  • Figure 4: Hardware setup. We evaluate our method on three robot platforms: (a)Kinova arm with LEAP hand, (b)Kinova arm with Inspire hand, and (c)Piper arm with Revo2 hand.
  • Figure 5: Simulated grasps of training hands on 5 diverse objects.
  • ...and 7 more figures