Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice
Junliang Li, Kai Ye, Haolan Kang, Mingxuan Liang, Yuhang Wu, Zhenhua Liu, Huiping Zhuang, Rui Huang, Yongquan Chen
TL;DR
This work tackles the problem of human-robot collaboration for dexterous grasping under natural language guidance in cluttered environments. It introduces EDGS, a modular framework combining Enriched Representation Guided Segmentation (ERGS) via Referring Expression Representation Enrichment (RERE), Dexterous Grasp Candidates Generation (DGCG), and Dexterous Grasp Refinement (DGR) to translate voice commands into robust grasps. RERE enriches referring expressions by cross-modal alignment with a Vision-Language Model, while DGCG and DGR leverage skeleton-based features, friction-aware evaluation, and STOMP-based trajectory optimization to produce and refine grasps with high stability. Extensive real-world experiments and application scenarios demonstrate high grasp success rates across diverse objects and clutter, underscoring the practical viability and potential of voice-driven embodied dexterous manipulation for real-world deployment.
Abstract
In recent years, as robotics has advanced, human-robot collaboration has gained increasing importance. However, current robots struggle to fully and accurately interpret human intentions from voice commands alone. Traditional gripper and suction systems often fail to interact naturally with humans, lack advanced manipulation capabilities, and are not adaptable to diverse tasks, especially in unstructured environments. This paper introduces the Embodied Dexterous Grasping System (EDGS), designed to tackle object grasping in cluttered environments for human-robot interaction. We propose a novel approach to semantic-object alignment using a Vision-Language Model (VLM) that fuses voice commands and visual information, significantly enhancing the alignment of multi-dimensional attributes of target objects in complex scenarios. Inspired by human hand-object interactions, we develop a robust, precise, and efficient grasping strategy, incorporating principles like the thumb-object axis, multi-finger wrapping, and fingertip interaction with an object's contact mechanics. We also design experiments to assess Referring Expression Representation Enrichment (RERE) in referring expression segmentation, demonstrating that our system accurately detects and matches referring expressions. Extensive experiments confirm that EDGS can effectively handle complex grasping tasks, achieving stability and high success rates, highlighting its potential for further development in the field of Embodied AI.
