Table of Contents
Fetching ...

AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance

Yi-Lin Wei, Mu Lin, Yuhao Lin, Jian-Jian Jiang, Xiao-Ming Wu, Ling-An Zeng, Wei-Shi Zheng

TL;DR

This work tackles open-set language-guided dexterous grasp by introducing AffordDexGrasp, which bridges language and high-DOF grasp actions through a Generalizable-Instructive Affordance. It couples two flow-matching models—Affordance Flow Matching and Grasp Flow Matching—with a pre-understanding stage based on a Multimodal Large Language Model, and augments performance with an affordance-guided pose optimization. The approach yields strong open-set generalization in both simulation and real-world tests, outperforming state-of-the-art methods in intention consistency and grasp quality while maintaining reasonable diversity. This framework enables robust, category-agnostic, language-conditioned dexterous manipulation, with potential extensions to complex manipulation tasks via integration with task planning and perception models.

Abstract

Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot actions. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight of bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordance Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in open-set generalization.

AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance

TL;DR

This work tackles open-set language-guided dexterous grasp by introducing AffordDexGrasp, which bridges language and high-DOF grasp actions through a Generalizable-Instructive Affordance. It couples two flow-matching models—Affordance Flow Matching and Grasp Flow Matching—with a pre-understanding stage based on a Multimodal Large Language Model, and augments performance with an affordance-guided pose optimization. The approach yields strong open-set generalization in both simulation and real-world tests, outperforming state-of-the-art methods in intention consistency and grasp quality while maintaining reasonable diversity. This framework enables robust, category-agnostic, language-conditioned dexterous manipulation, with potential extensions to complex manipulation tasks via integration with task planning and perception models.

Abstract

Language-guided robot dexterous generation enables robots to grasp and manipulate objects based on human commands. However, previous data-driven methods are hard to understand intention and execute grasping with unseen categories in the open set. In this work, we explore a new task, Open-set Language-guided Dexterous Grasp, and find that the main challenge is the huge gap between high-level human language semantics and low-level robot actions. To solve this problem, we propose an Affordance Dexterous Grasp (AffordDexGrasp) framework, with the insight of bridging the gap with a new generalizable-instructive affordance representation. This affordance can generalize to unseen categories by leveraging the object's local structure and category-agnostic semantic attributes, thereby effectively guiding dexterous grasp generation. Built upon the affordance, our framework introduces Affordance Flow Matching (AFM) for affordance generation with language as input, and Grasp Flow Matching (GFM) for generating dexterous grasp with affordance as input. To evaluate our framework, we build an open-set table-top language-guided dexterous grasp dataset. Extensive experiments in the simulation and real worlds show that our framework surpasses all previous methods in open-set generalization.

Paper Structure

This paper contains 40 sections, 18 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Open-set Language-guided Dexterous Grasp. Our framework bridges the gap between language and grasp actions through Generalizable-Instructive Affordance, which enables cross-category generalization via category-agnostic cues and graspable local structure. Remarkably, our framework demonstrates strong generalization without requiring extra real training data in real-world experiments.
  • Figure 2: Different affordance representations. (a) While contact map are too elaborate to generalize and object part are too coarse to guide grasping, our affrodacne achieve a balance. (b) While object part has a lower upper bound and contact shows significant degradation in generalization, only our affordance effectively achieve the balance (Top-1 indicates grasp intention consistency).
  • Figure 3: The pipeline of Affordance Dexterous Grasp framework. The inference pipeline includes three stages: 1) intention pre-understanding assisted by MLLM; 2) affordance flow matching for generating affordance base on MLLM ouput; 3) Grasp Flow Matching and Optimization for outputing grasp poses based on the affordance and MLLM outputs. In the training time, AFM and GFM are independently trained one after another. Transformer and Perceiver are attention-based interaction module for velocity vector field prediction.
  • Figure 4: The visualization of generated affordance and dexterous grasp. The left top shows the zero-shot samples and the left bottom shows the one-shot samples in real world. The right top and right bottom show the zero-shot samples in simulation open set A and B.
  • Figure 5: The training pipeline of Affordance Flow Matching and Grasp Flow Matching.
  • ...and 10 more figures