Table of Contents
Fetching ...

LangGrasp: Leveraging Fine-Tuned LLMs for Language Interactive Robot Grasping with Ambiguous Instructions

Yunhan Lin, Wenqi Wu, Zhijie Zhang, Huasong Min

TL;DR

LangGrasp tackles ambiguity in language-driven robotic grasping by fine-tuning LLMs to produce structured, executable action sequences from complex instructions. It introduces a perception-and-inference module, a part-aware point cloud localization pipeline guided by 2D segmentation, and a flexible grasp pose detection component to achieve fine-grained, part-level manipulation. Experimental results show that fine-tuning improves semantic understanding, structured output, and inference granularity, while the expansion-based localization approach enhances grasp quality and reduces collisions in both desktop and cabinet scenes. The framework demonstrates real-world applicability with robust performance on simple and ordinary instructions, and points to future work on multi-object and dynamic scenarios.

Abstract

The existing language-driven grasping methods struggle to fully handle ambiguous instructions containing implicit intents. To tackle this challenge, we propose LangGrasp, a novel language-interactive robotic grasping framework. The framework integrates fine-tuned large language models (LLMs) to leverage their robust commonsense understanding and environmental perception capabilities, thereby deducing implicit intents from linguistic instructions and clarifying task requirements along with target manipulation objects. Furthermore, our designed point cloud localization module, guided by 2D part segmentation, enables partial point cloud localization in scenes, thereby extending grasping operations from coarse-grained object-level to fine-grained part-level manipulation. Experimental results show that the LangGrasp framework accurately resolves implicit intents in ambiguous instructions, identifying critical operations and target information that are unstated yet essential for task completion. Additionally, it dynamically selects optimal grasping poses by integrating environmental information. This enables high-precision grasping from object-level to part-level manipulation, significantly enhancing the adaptability and task execution efficiency of robots in unstructured environments. More information and code are available here: https://github.com/wu467/LangGrasp.

LangGrasp: Leveraging Fine-Tuned LLMs for Language Interactive Robot Grasping with Ambiguous Instructions

TL;DR

LangGrasp tackles ambiguity in language-driven robotic grasping by fine-tuning LLMs to produce structured, executable action sequences from complex instructions. It introduces a perception-and-inference module, a part-aware point cloud localization pipeline guided by 2D segmentation, and a flexible grasp pose detection component to achieve fine-grained, part-level manipulation. Experimental results show that fine-tuning improves semantic understanding, structured output, and inference granularity, while the expansion-based localization approach enhances grasp quality and reduces collisions in both desktop and cabinet scenes. The framework demonstrates real-world applicability with robust performance on simple and ordinary instructions, and points to future work on multi-object and dynamic scenarios.

Abstract

The existing language-driven grasping methods struggle to fully handle ambiguous instructions containing implicit intents. To tackle this challenge, we propose LangGrasp, a novel language-interactive robotic grasping framework. The framework integrates fine-tuned large language models (LLMs) to leverage their robust commonsense understanding and environmental perception capabilities, thereby deducing implicit intents from linguistic instructions and clarifying task requirements along with target manipulation objects. Furthermore, our designed point cloud localization module, guided by 2D part segmentation, enables partial point cloud localization in scenes, thereby extending grasping operations from coarse-grained object-level to fine-grained part-level manipulation. Experimental results show that the LangGrasp framework accurately resolves implicit intents in ambiguous instructions, identifying critical operations and target information that are unstated yet essential for task completion. Additionally, it dynamically selects optimal grasping poses by integrating environmental information. This enables high-precision grasping from object-level to part-level manipulation, significantly enhancing the adaptability and task execution efficiency of robots in unstructured environments. More information and code are available here: https://github.com/wu467/LangGrasp.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The overall framework of our proposed method. The LangGrasp framework primarily consists of three modules: Perception and Inference, Point Cloud Localization, and Grasp Pose Detection. Inputting an RGB image, a Depth image, and language instructions, the framework outputs optimal 6-DoF grasp poses.
  • Figure 2: The procedure of LangGrasp consists of three stages: 1. the Perception and Inference stage, where structured reasoning results are generated based on the current scene and multi-turn dialogue information; 2. the Point Cloud Localization stage, where target object point clouds are localized in the global point cloud using semantic information generated in the previous stage, and the target point cloud region is optimized through an expansion strategy; 3. the Grasp Pose Detection stage, where the 6-DoF grasp pose of the local point cloud is predicted, and the pose with the highest score data is sent to the robotic arm for grasp execution.
  • Figure 3: Experimental objects and platforms. (a) Objects dataset. (b) Desktop experimental scene. (c) Cabinet experimental scene.
  • Figure 4: Collection and processing of the fine-tuning dataset.
  • Figure 5: GPT-4o online fine-tuning process. A total of 100 training epochs were conducted, with the training loss rapidly decreasing in the early stages before stabilizing, indicating effective convergence and no significant overfitting.
  • ...and 1 more figures