Table of Contents
Fetching ...

Lightweight Language-driven Grasp Detection using Conditional Consistency Model

Nghia Nguyen, Minh Nhat Vu, Baoru Huang, An Vuong, Ngan Le, Thieu Vo, Anh Nguyen

TL;DR

This work tackles language-driven grasp detection under real-time constraints by introducing Lightweight Language-driven Grasp Detection (LLGD), a system that combines continuous diffusion with a conditional consistency model conditioned on image-text features. Using ALBEF fusion to encode visuals and language, LLGD trains a score network and a conditional consistency model to enable fast, few-step inference while maintaining high accuracy. Empirical results on Grasp-Anything and real-robot trials show that LLGD outperforms traditional grasp detectors and other lightweight diffusion methods, offering competitive latency and strong zero-shot generalization. The approach makes language-guided robotic grasping more practical for cluttered and real-world environments, with potential extensions to 3D perception and more complex linguistic instructions.

Abstract

Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning that aligns well with the text query. To overcome the long inference time problem in diffusion models, we leverage the image and text features as the condition in the consistency model to reduce the number of denoising timesteps during inference. The intensive experimental results show that our method outperforms other recent grasp detection methods and lightweight diffusion models by a clear margin. We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability.

Lightweight Language-driven Grasp Detection using Conditional Consistency Model

TL;DR

This work tackles language-driven grasp detection under real-time constraints by introducing Lightweight Language-driven Grasp Detection (LLGD), a system that combines continuous diffusion with a conditional consistency model conditioned on image-text features. Using ALBEF fusion to encode visuals and language, LLGD trains a score network and a conditional consistency model to enable fast, few-step inference while maintaining high accuracy. Empirical results on Grasp-Anything and real-robot trials show that LLGD outperforms traditional grasp detectors and other lightweight diffusion methods, offering competitive latency and strong zero-shot generalization. The approach makes language-guided robotic grasping more practical for cluttered and real-world environments, with potential extensions to 3D perception and more complex linguistic instructions.

Abstract

Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning that aligns well with the text query. To overcome the long inference time problem in diffusion models, we leverage the image and text features as the condition in the consistency model to reduce the number of denoising timesteps during inference. The intensive experimental results show that our method outperforms other recent grasp detection methods and lightweight diffusion models by a clear margin. We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability.
Paper Structure (15 sections, 18 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 18 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: The overview of our method. First, the input RGB image and text prompt are fed into the feature encoder and ALBEF fusion li2021align. Subsequently, we concurrently train two models with the same architectures: A score network to estimate the probability flow Ordinary Differential Equation (ODE) trajectory song2020score for the diffusion process and a conditional consistency model to determine the grasp pose with a few denoising steps.
  • Figure 2: Consistency model analysis. With text prompt input "Grasp the cup at its handle", we compare the trajectory grasp pose of our method and LGDvuong2024language. In the figure, the top row illustrates the trajectory of LGD, while the bottom row corresponds to the trajectory of our LLGD.
  • Figure 3: Visualization of detection results of different language-driven grasp detection methods.
  • Figure 4: In the wild detection results. Images are from the internet.
  • Figure 5: Prediction failure cases.

Theorems & Definitions (1)

  • proof