Table of Contents
Fetching ...

Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice

Junliang Li, Kai Ye, Haolan Kang, Mingxuan Liang, Yuhang Wu, Zhenhua Liu, Huiping Zhuang, Rui Huang, Yongquan Chen

TL;DR

This work tackles the problem of human-robot collaboration for dexterous grasping under natural language guidance in cluttered environments. It introduces EDGS, a modular framework combining Enriched Representation Guided Segmentation (ERGS) via Referring Expression Representation Enrichment (RERE), Dexterous Grasp Candidates Generation (DGCG), and Dexterous Grasp Refinement (DGR) to translate voice commands into robust grasps. RERE enriches referring expressions by cross-modal alignment with a Vision-Language Model, while DGCG and DGR leverage skeleton-based features, friction-aware evaluation, and STOMP-based trajectory optimization to produce and refine grasps with high stability. Extensive real-world experiments and application scenarios demonstrate high grasp success rates across diverse objects and clutter, underscoring the practical viability and potential of voice-driven embodied dexterous manipulation for real-world deployment.

Abstract

In recent years, as robotics has advanced, human-robot collaboration has gained increasing importance. However, current robots struggle to fully and accurately interpret human intentions from voice commands alone. Traditional gripper and suction systems often fail to interact naturally with humans, lack advanced manipulation capabilities, and are not adaptable to diverse tasks, especially in unstructured environments. This paper introduces the Embodied Dexterous Grasping System (EDGS), designed to tackle object grasping in cluttered environments for human-robot interaction. We propose a novel approach to semantic-object alignment using a Vision-Language Model (VLM) that fuses voice commands and visual information, significantly enhancing the alignment of multi-dimensional attributes of target objects in complex scenarios. Inspired by human hand-object interactions, we develop a robust, precise, and efficient grasping strategy, incorporating principles like the thumb-object axis, multi-finger wrapping, and fingertip interaction with an object's contact mechanics. We also design experiments to assess Referring Expression Representation Enrichment (RERE) in referring expression segmentation, demonstrating that our system accurately detects and matches referring expressions. Extensive experiments confirm that EDGS can effectively handle complex grasping tasks, achieving stability and high success rates, highlighting its potential for further development in the field of Embodied AI.

Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice

TL;DR

This work tackles the problem of human-robot collaboration for dexterous grasping under natural language guidance in cluttered environments. It introduces EDGS, a modular framework combining Enriched Representation Guided Segmentation (ERGS) via Referring Expression Representation Enrichment (RERE), Dexterous Grasp Candidates Generation (DGCG), and Dexterous Grasp Refinement (DGR) to translate voice commands into robust grasps. RERE enriches referring expressions by cross-modal alignment with a Vision-Language Model, while DGCG and DGR leverage skeleton-based features, friction-aware evaluation, and STOMP-based trajectory optimization to produce and refine grasps with high stability. Extensive real-world experiments and application scenarios demonstrate high grasp success rates across diverse objects and clutter, underscoring the practical viability and potential of voice-driven embodied dexterous manipulation for real-world deployment.

Abstract

In recent years, as robotics has advanced, human-robot collaboration has gained increasing importance. However, current robots struggle to fully and accurately interpret human intentions from voice commands alone. Traditional gripper and suction systems often fail to interact naturally with humans, lack advanced manipulation capabilities, and are not adaptable to diverse tasks, especially in unstructured environments. This paper introduces the Embodied Dexterous Grasping System (EDGS), designed to tackle object grasping in cluttered environments for human-robot interaction. We propose a novel approach to semantic-object alignment using a Vision-Language Model (VLM) that fuses voice commands and visual information, significantly enhancing the alignment of multi-dimensional attributes of target objects in complex scenarios. Inspired by human hand-object interactions, we develop a robust, precise, and efficient grasping strategy, incorporating principles like the thumb-object axis, multi-finger wrapping, and fingertip interaction with an object's contact mechanics. We also design experiments to assess Referring Expression Representation Enrichment (RERE) in referring expression segmentation, demonstrating that our system accurately detects and matches referring expressions. Extensive experiments confirm that EDGS can effectively handle complex grasping tasks, achieving stability and high success rates, highlighting its potential for further development in the field of Embodied AI.

Paper Structure

This paper contains 23 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of Our Method. The system processes natural language instructions via a speech recognition module and RGB-D scene data with a vision-language model to generate a r ich object description. A segmentation model isolates the target object, creating a segmented point cloud. The policy generation module then computes grasp strategies, executed by a robotic arm with proprioceptive feedback for precise manipulation.
  • Figure 2: Grasp Policy Generation Method. Our method starts with segmented point cloud for feature extraction, followed by constrained sampling and contact point estimation. Parameter ($K_o$) related to constrained sampling is determined by grasp affordance assessment. A GPT-aided module estimates the friction coefficient ($\mu$) for force closure filtering, and the grasp action sets are refined through GWS quality assessment to determine the best 12D action.
  • Figure 3: Overview of the Experimental Setup.
  • Figure 4: Segmentation Error Analysis in Grasping Scenarios: Comparison of Results with and without RERE. This figure shows four common segmentation errors in grasping tasks: (a) Class confusion, (b) Boundary inaccuracies, (c) Object merging, and (d) False negatives.
  • Figure 5: Grasping Scenarios for Eleven Objects. Experimental setups showing the system's performance across diverse objects.
  • ...and 1 more figures