Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Toan Nguyen; Minh Nhat Vu; Baoru Huang; An Vuong; Quan Vuong; Ngan Le; Thieu Vo; Anh Nguyen

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Toan Nguyen, Minh Nhat Vu, Baoru Huang, An Vuong, Quan Vuong, Ngan Le, Thieu Vo, Anh Nguyen

TL;DR

This work tackles language-driven 6-DoF grasp detection in cluttered 3D point clouds by introducing the Grasp-Anything-6D dataset (1M scenes with ~${200}{M}$ dense grasps) and a diffusion-based model, LGrasp6D, that leverages negative prompt guidance to steer grasps toward a target object specified by natural language. The model fuses a CLIP-based text cue, a PointNet++ scene encoder, and an $ ext{se}(3)$ grasp representation within a diffusion framework, training with a combined noise-prediction and negative-prompt loss and performing sampling with a compositional denoising scheme. Empirical results on the Grasp-Anything-6D dataset and cross-dataset generalization show superior performance over baselines, and real-world robot experiments with a KUKA arm validate practical viability, illustrating the method’s potential for intuitive human–robot collaboration in cluttered environments.

Abstract

6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds. We first introduce Grasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF grasp detection task with 1M point cloud scenes and more than 200M language-associated 3D grasp poses. We further introduce a novel diffusion model that incorporates a new negative prompt guidance learning strategy. The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones given the language input. Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language. Intensive experimental results show the effectiveness of our method in both benchmarking experiments and real-world scenarios, surpassing other baselines. In addition, we demonstrate the practicality of our approach in real-world robotic applications. Our project is available at https://airvlab.github.io/grasp-anything.

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

TL;DR

This work tackles language-driven 6-DoF grasp detection in cluttered 3D point clouds by introducing the Grasp-Anything-6D dataset (1M scenes with ~

dense grasps) and a diffusion-based model, LGrasp6D, that leverages negative prompt guidance to steer grasps toward a target object specified by natural language. The model fuses a CLIP-based text cue, a PointNet++ scene encoder, and an

grasp representation within a diffusion framework, training with a combined noise-prediction and negative-prompt loss and performing sampling with a compositional denoising scheme. Empirical results on the Grasp-Anything-6D dataset and cross-dataset generalization show superior performance over baselines, and real-world robot experiments with a KUKA arm validate practical viability, illustrating the method’s potential for intuitive human–robot collaboration in cluttered environments.

Abstract

Paper Structure (25 sections, 1 theorem, 13 equations, 13 figures, 7 tables)

This paper contains 25 sections, 1 theorem, 13 equations, 13 figures, 7 tables.

Introduction
Related Works
The Grasp-Anything-6D Dataset
Grasp Detection using Negative Prompt Guidance
Motivation
Language-Driven 6-DoF Grasp Detection
Training and Sampling
Experiments
Language-Driven 6-DoF Grasp Detection Results
Generalization Analysis
Negative Prompt Guidance Analysis
Robotics Experiment
Discussion
Conclusion
Theoretical Findings
...and 10 more sections

Key Result

proposition thmcounterproposition

The conditional distribution $p\left ( \mathbf{g} | \mathbf{S},\mathbf{t},\neg\tilde{\mathbf{t}}\right )$ can be factorized as

Figures (13)

Figure 1: We tackle the task of language-driven 6-DoF grasp detection in cluttered 3D point cloud scenes.
Figure 2: Overview of Grasp-Anything-6D dataset construction pipeline.
Figure 3: Overview of our denoising network. In addition to predicting the noise, our denoising network is trained to learn the negative prompt embedding, which is supervised by the text embeddings associated with other unwanted objects in the same scene.
Figure 4: Language-driven 6-DoF grasp detection qualitative results.
Figure 5: In the wild language-driven 6-DoF grasp detection results.
...and 8 more figures

Theorems & Definitions (4)

proposition thmcounterproposition
proof
remark thmcounterremark
proof

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

TL;DR

Abstract

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (4)