Table of Contents
Fetching ...

ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning

Juntao Jian, Yi-Lin Wei, Chengjie Mou, Yuhao Lin, Xing Zhu, Yujun Shen, Wei-Shi Zheng, Ruizhen Hu

TL;DR

The paper tackles zero-shot task-oriented dexterous grasping across diverse objects and task instructions. It proposes ZeroDexGrasp, a two-component pipeline that combines prompt-based multi-stage semantic reasoning with contact-guided grasp refinement to synthesize grasp parameters $G = \{T, R, \theta\}$, where $T \in \mathbb{R}^3$, $R \in SO(3)$, and $\theta \in \mathbb{R}^{16}$. Key contributions include a Set-of-Mark-based part-level contact inference, imagination-driven initial hand rotation inference, geometry-guided verification, and an energy-based refinement framework that ties semantic requirements to physically feasible grasps. Experiments show strong zero-shot generalization to open-set objects and complex tasks, with significant improvements in semantic alignment and reduced interpenetration, and demonstrations on real robotic hardware, highlighting practical impact for flexible robotic manipulation. The approach remains mindful of limitations in MLLM reliability and calls for robust grounding and feedback in future work, which would further enhance real-world robustness.

Abstract

Task-oriented dexterous grasping holds broad application prospects in robotic manipulation and human-object interaction. However, most existing methods still struggle to generalize across diverse objects and task instructions, as they heavily rely on costly labeled data to ensure task-specific semantic alignment. In this study, we propose \textbf{ZeroDexGrasp}, a zero-shot task-oriented dexterous grasp synthesis framework integrating Multimodal Large Language Models with grasp refinement to generate human-like grasp poses that are well aligned with specific task objectives and object affordances. Specifically, ZeroDexGrasp employs prompt-based multi-stage semantic reasoning to infer initial grasp configurations and object contact information from task and object semantics, then exploits contact-guided grasp optimization to refine these poses for physical feasibility and task alignment. Experimental results demonstrate that ZeroDexGrasp enables high-quality zero-shot dexterous grasping on diverse unseen object categories and complex task requirements, advancing toward more generalizable and intelligent robotic grasping.

ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning

TL;DR

The paper tackles zero-shot task-oriented dexterous grasping across diverse objects and task instructions. It proposes ZeroDexGrasp, a two-component pipeline that combines prompt-based multi-stage semantic reasoning with contact-guided grasp refinement to synthesize grasp parameters , where , , and . Key contributions include a Set-of-Mark-based part-level contact inference, imagination-driven initial hand rotation inference, geometry-guided verification, and an energy-based refinement framework that ties semantic requirements to physically feasible grasps. Experiments show strong zero-shot generalization to open-set objects and complex tasks, with significant improvements in semantic alignment and reduced interpenetration, and demonstrations on real robotic hardware, highlighting practical impact for flexible robotic manipulation. The approach remains mindful of limitations in MLLM reliability and calls for robust grounding and feedback in future work, which would further enhance real-world robustness.

Abstract

Task-oriented dexterous grasping holds broad application prospects in robotic manipulation and human-object interaction. However, most existing methods still struggle to generalize across diverse objects and task instructions, as they heavily rely on costly labeled data to ensure task-specific semantic alignment. In this study, we propose \textbf{ZeroDexGrasp}, a zero-shot task-oriented dexterous grasp synthesis framework integrating Multimodal Large Language Models with grasp refinement to generate human-like grasp poses that are well aligned with specific task objectives and object affordances. Specifically, ZeroDexGrasp employs prompt-based multi-stage semantic reasoning to infer initial grasp configurations and object contact information from task and object semantics, then exploits contact-guided grasp optimization to refine these poses for physical feasibility and task alignment. Experimental results demonstrate that ZeroDexGrasp enables high-quality zero-shot dexterous grasping on diverse unseen object categories and complex task requirements, advancing toward more generalizable and intelligent robotic grasping.

Paper Structure

This paper contains 17 sections, 10 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Given task instructions and target objects, ZeroDexGrasp synthesizes dexterous grasps in a zero-shot manner by reasoning contact information, hand position, orientation, grasp type, and followed by contact-guided refinement to ensure semantic alignment and physical feasibility.
  • Figure 2: Overall pipeline. ZeroDexGrasp consists of two main components. The first is prompt-based multi-stage semantic reasoning, comprising three steps: (1) contact information inference, (2) grasp type and initial hand position inference, and (3) hand rotation inference. The second component, shown as part (4) in the figure, is contact-guided grasp optimization based on contact priors and the initial hand pose.
  • Figure 3: Pipeline of part-level contact inference. Semantic-aligned 2D part-level contact region are identified, back-projected to 2.5D, and the final 3D part-level contact region is inferred via feature clustering and classification.
  • Figure 4: Illustration of eometry-Guided Verification. (a) Hand rotation filtering via local surface normals, where the red point denotes the nearest surface point and the blue points indicate its neighbors. (b) point-level contact validation using force-normal consistency.
  • Figure 5: Qualitative results of Ours. ZeroDexGrasp achieves zero-shot, semantically aligned, and physically feasible grasps for diverse objects and tasks. The top-right corner shows the inferred part-level contact region (blue), point-level contact for the index finger (red) and the thumb (green).
  • ...and 2 more figures