ZeroDexGrasp: Zero-Shot Task-Oriented Dexterous Grasp Synthesis with Prompt-Based Multi-Stage Semantic Reasoning
Juntao Jian, Yi-Lin Wei, Chengjie Mou, Yuhao Lin, Xing Zhu, Yujun Shen, Wei-Shi Zheng, Ruizhen Hu
TL;DR
The paper tackles zero-shot task-oriented dexterous grasping across diverse objects and task instructions. It proposes ZeroDexGrasp, a two-component pipeline that combines prompt-based multi-stage semantic reasoning with contact-guided grasp refinement to synthesize grasp parameters $G = \{T, R, \theta\}$, where $T \in \mathbb{R}^3$, $R \in SO(3)$, and $\theta \in \mathbb{R}^{16}$. Key contributions include a Set-of-Mark-based part-level contact inference, imagination-driven initial hand rotation inference, geometry-guided verification, and an energy-based refinement framework that ties semantic requirements to physically feasible grasps. Experiments show strong zero-shot generalization to open-set objects and complex tasks, with significant improvements in semantic alignment and reduced interpenetration, and demonstrations on real robotic hardware, highlighting practical impact for flexible robotic manipulation. The approach remains mindful of limitations in MLLM reliability and calls for robust grounding and feedback in future work, which would further enhance real-world robustness.
Abstract
Task-oriented dexterous grasping holds broad application prospects in robotic manipulation and human-object interaction. However, most existing methods still struggle to generalize across diverse objects and task instructions, as they heavily rely on costly labeled data to ensure task-specific semantic alignment. In this study, we propose \textbf{ZeroDexGrasp}, a zero-shot task-oriented dexterous grasp synthesis framework integrating Multimodal Large Language Models with grasp refinement to generate human-like grasp poses that are well aligned with specific task objectives and object affordances. Specifically, ZeroDexGrasp employs prompt-based multi-stage semantic reasoning to infer initial grasp configurations and object contact information from task and object semantics, then exploits contact-guided grasp optimization to refine these poses for physical feasibility and task alignment. Experimental results demonstrate that ZeroDexGrasp enables high-quality zero-shot dexterous grasping on diverse unseen object categories and complex task requirements, advancing toward more generalizable and intelligent robotic grasping.
