Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors
Haoyu Zhao, Linghao Zhuang, Xingyue Zhao, Cheng Zeng, Haoran Xu, Yuming Jiang, Jun Cen, Kexiang Wang, Jiayan Guo, Siteng Huang, Xin Li, Deli Zhao, Hua Zou
TL;DR
This work tackles generalizable, safe, and human-like dexterous grasping by introducing AffordDex, a two-stage framework that marries strong human motion priors with affordance-aware refinement. The first stage pre-trains a human-motion policy $\pi^H$ via imitation, while the second stage learns a residual via PPO guided by the Negative Affordance-aware Segmentation $N_t$ and distills a vision-based policy $\pi^S$ through DAgger using privileged state information $S^T_t$. A novel NAA module leverages Vision-Language Models to identify non-touchable regions, producing a robust negative affordance map and enabling functionally correct grasps that remain human-like. Evaluations on UniDexGrasp and OakInk2 show state-of-the-art performance across seen, unseen, and novel categories, with improved naturalness and safer contact locations, highlighting strong generalization and practical potential for downstream manipulation and sim-to-real transfer.
Abstract
A dexterous hand capable of generalizable grasping objects is fundamental for the development of general-purpose embodied AI. However, previous methods focus narrowly on low-level grasp stability metrics, neglecting affordance-aware positioning and human-like poses which are crucial for downstream manipulation. To address these limitations, we propose AffordDex, a novel framework with two-stage training that learns a universal grasping policy with an inherent understanding of both motion priors and object affordances. In the first stage, a trajectory imitator is pre-trained on a large corpus of human hand motions to instill a strong prior for natural movement. In the second stage, a residual module is trained to adapt these general human-like motions to specific object instances. This refinement is critically guided by two components: our Negative Affordance-aware Segmentation (NAA) module, which identifies functionally inappropriate contact regions, and a privileged teacher-student distillation process that ensures the final vision-based policy is highly successful. Extensive experiments demonstrate that AffordDex not only achieves universal dexterous grasping but also remains remarkably human-like in posture and functionally appropriate in contact location. As a result, AffordDex significantly outperforms state-of-the-art baselines across seen objects, unseen instances, and even entirely novel categories.
