Table of Contents
Fetching ...

Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors

Haoyu Zhao, Linghao Zhuang, Xingyue Zhao, Cheng Zeng, Haoran Xu, Yuming Jiang, Jun Cen, Kexiang Wang, Jiayan Guo, Siteng Huang, Xin Li, Deli Zhao, Hua Zou

TL;DR

This work tackles generalizable, safe, and human-like dexterous grasping by introducing AffordDex, a two-stage framework that marries strong human motion priors with affordance-aware refinement. The first stage pre-trains a human-motion policy $\pi^H$ via imitation, while the second stage learns a residual via PPO guided by the Negative Affordance-aware Segmentation $N_t$ and distills a vision-based policy $\pi^S$ through DAgger using privileged state information $S^T_t$. A novel NAA module leverages Vision-Language Models to identify non-touchable regions, producing a robust negative affordance map and enabling functionally correct grasps that remain human-like. Evaluations on UniDexGrasp and OakInk2 show state-of-the-art performance across seen, unseen, and novel categories, with improved naturalness and safer contact locations, highlighting strong generalization and practical potential for downstream manipulation and sim-to-real transfer.

Abstract

A dexterous hand capable of generalizable grasping objects is fundamental for the development of general-purpose embodied AI. However, previous methods focus narrowly on low-level grasp stability metrics, neglecting affordance-aware positioning and human-like poses which are crucial for downstream manipulation. To address these limitations, we propose AffordDex, a novel framework with two-stage training that learns a universal grasping policy with an inherent understanding of both motion priors and object affordances. In the first stage, a trajectory imitator is pre-trained on a large corpus of human hand motions to instill a strong prior for natural movement. In the second stage, a residual module is trained to adapt these general human-like motions to specific object instances. This refinement is critically guided by two components: our Negative Affordance-aware Segmentation (NAA) module, which identifies functionally inappropriate contact regions, and a privileged teacher-student distillation process that ensures the final vision-based policy is highly successful. Extensive experiments demonstrate that AffordDex not only achieves universal dexterous grasping but also remains remarkably human-like in posture and functionally appropriate in contact location. As a result, AffordDex significantly outperforms state-of-the-art baselines across seen objects, unseen instances, and even entirely novel categories.

Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors

TL;DR

This work tackles generalizable, safe, and human-like dexterous grasping by introducing AffordDex, a two-stage framework that marries strong human motion priors with affordance-aware refinement. The first stage pre-trains a human-motion policy via imitation, while the second stage learns a residual via PPO guided by the Negative Affordance-aware Segmentation and distills a vision-based policy through DAgger using privileged state information . A novel NAA module leverages Vision-Language Models to identify non-touchable regions, producing a robust negative affordance map and enabling functionally correct grasps that remain human-like. Evaluations on UniDexGrasp and OakInk2 show state-of-the-art performance across seen, unseen, and novel categories, with improved naturalness and safer contact locations, highlighting strong generalization and practical potential for downstream manipulation and sim-to-real transfer.

Abstract

A dexterous hand capable of generalizable grasping objects is fundamental for the development of general-purpose embodied AI. However, previous methods focus narrowly on low-level grasp stability metrics, neglecting affordance-aware positioning and human-like poses which are crucial for downstream manipulation. To address these limitations, we propose AffordDex, a novel framework with two-stage training that learns a universal grasping policy with an inherent understanding of both motion priors and object affordances. In the first stage, a trajectory imitator is pre-trained on a large corpus of human hand motions to instill a strong prior for natural movement. In the second stage, a residual module is trained to adapt these general human-like motions to specific object instances. This refinement is critically guided by two components: our Negative Affordance-aware Segmentation (NAA) module, which identifies functionally inappropriate contact regions, and a privileged teacher-student distillation process that ensures the final vision-based policy is highly successful. Extensive experiments demonstrate that AffordDex not only achieves universal dexterous grasping but also remains remarkably human-like in posture and functionally appropriate in contact location. As a result, AffordDex significantly outperforms state-of-the-art baselines across seen objects, unseen instances, and even entirely novel categories.

Paper Structure

This paper contains 22 sections, 12 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance comparision among UniDexGrasp xu2023unidexgrasp, UniDexGrasp++ wan2023unidexgrasp++, and our AffordDex, on the vision-based setting. we report human-likeness score (HLS) and affordance score (AS) across seen objects, unseen objects, and unseen categories. We also present a qualitative comparison, where AffordDex performs natural and safe grasping by avoiding the blade.
  • Figure 2: Pipeline of AffordDex. To generate grasps with affordance-aware positioning and human-like kinematics, crucial for facilitating downstream manipulation, we propose a novel two-stage framework. The first stage establishes a strong human motion prior by training a base policy $\pi^H$, on a human motion dataset via imitation learning. This constrains the policy to a space of natural, human-like movements. Subsequently, the second stage employs reinforcement learning (RL) to refine this coarse policy $\pi^H$ for precise, functional interaction. We fine-tune $\pi^H$ with a residual module that is guided by our Negative Affordance-aware Segmentation (NAA) module, which provides explicit constraints on where not to touch the object. The entire learning pipeline is further enhanced by a teacher-student distillation framework, leveraging privileged inputs to significantly boost the final grasping performance.
  • Figure 3: Visualization of Negative Affordances Predicted by our NAA. The point cloud, highlighted in red, represents the negative affordances identified on various objects. These points denote regions that are functionally unsafe or inappropriate for grasping, such as a knife's blade.
  • Figure 4: Qualitative Comparison on UniDexGrasp xu2023unidexgrasp and OakInk2 zhan2024oakink2. A comparison of grasps generated by our AffordDex with several baselines, including UniDexGrasp xu2023unidexgrasp, UniDexGrasp++ wan2023unidexgrasp++, and DexGrasp Anything zhong2025dexgrasp.
  • Figure 5: Ablation Study on Human Hand Trajectory Imitating (HTI). Without the human motion prior, the policy converges to a solution that, while potentially successful, is kinematically awkward and non-humanlike.
  • ...and 3 more figures