Table of Contents
Fetching ...

VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation

Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Yuedi Zhang, Qi Zhang, Pengxiang Ding, Cheng Chi, Donglin Wang, Badong Chen

TL;DR

VCoT-Grasp introduces a visual chain-of-thought framework for language-driven grasp generation, enabling end-to-end reasoning that localizes target objects via intermediate bounding boxes before predicting grasps. Built on a Paligemma-based VLM, the model supports multiple action-head designs and leverages multi-turn in-context reasoning to improve fine-grained visual understanding in cluttered scenes. A refined dataset, VCoT-GraspSet, combines 167K synthetic images (1.36M grasps) with 400+ real-world images (1.2K grasps) and includes intermediate bounding boxes as chain-of-thought context. Experiments show superior in-distribution performance and robust generalization to unseen objects, distractors, and background changes, including strong zero-shot transfer to real robots, suggesting significant practical impact for language-guided robotic manipulation.

Abstract

Robotic grasping is one of the most fundamental tasks in robotic manipulation, and grasp detection/generation has long been the subject of extensive research. Recently, language-driven grasp generation has emerged as a promising direction due to its practical interaction capabilities. However, most existing approaches either lack sufficient reasoning and generalization capabilities or depend on complex modular pipelines. Moreover, current grasp foundation models tend to overemphasize dialog and object semantics, resulting in inferior performance and restriction to single-object grasping. To maintain strong reasoning ability and generalization in cluttered environments, we propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates visual chain-of-thought reasoning to enhance visual understanding for grasp generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically focuses on visual inputs while providing interpretable reasoning traces. For training, we refine and introduce a large-scale dataset, VCoT-GraspSet, comprising 167K synthetic images with over 1.36M grasps, as well as 400+ real-world images with more than 1.2K grasps, annotated with intermediate bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot demonstrate that our method significantly improves grasp success rates and generalizes effectively to unseen objects, backgrounds, and distractors. More details can be found at https://zhanghr2001.github.io/VCoT-Grasp.github.io.

VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation

TL;DR

VCoT-Grasp introduces a visual chain-of-thought framework for language-driven grasp generation, enabling end-to-end reasoning that localizes target objects via intermediate bounding boxes before predicting grasps. Built on a Paligemma-based VLM, the model supports multiple action-head designs and leverages multi-turn in-context reasoning to improve fine-grained visual understanding in cluttered scenes. A refined dataset, VCoT-GraspSet, combines 167K synthetic images (1.36M grasps) with 400+ real-world images (1.2K grasps) and includes intermediate bounding boxes as chain-of-thought context. Experiments show superior in-distribution performance and robust generalization to unseen objects, distractors, and background changes, including strong zero-shot transfer to real robots, suggesting significant practical impact for language-guided robotic manipulation.

Abstract

Robotic grasping is one of the most fundamental tasks in robotic manipulation, and grasp detection/generation has long been the subject of extensive research. Recently, language-driven grasp generation has emerged as a promising direction due to its practical interaction capabilities. However, most existing approaches either lack sufficient reasoning and generalization capabilities or depend on complex modular pipelines. Moreover, current grasp foundation models tend to overemphasize dialog and object semantics, resulting in inferior performance and restriction to single-object grasping. To maintain strong reasoning ability and generalization in cluttered environments, we propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates visual chain-of-thought reasoning to enhance visual understanding for grasp generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically focuses on visual inputs while providing interpretable reasoning traces. For training, we refine and introduce a large-scale dataset, VCoT-GraspSet, comprising 167K synthetic images with over 1.36M grasps, as well as 400+ real-world images with more than 1.2K grasps, annotated with intermediate bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot demonstrate that our method significantly improves grasp success rates and generalizes effectively to unseen objects, backgrounds, and distractors. More details can be found at https://zhanghr2001.github.io/VCoT-Grasp.github.io.

Paper Structure

This paper contains 15 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Diverging from prior language-driven grasp detection/generation approaches including (a) end-to-end multimodal feature fusion methods xu2023jointnguyen2024language, (b) LLM/VLM-guided modular pipelines tang2023graspgpt, and (c) end-to-end foundation models with language reasoning xu2024rt, our method (d) advocates visual chain-of-thought reasoning, encouraging the model to "think with images.” It emphasizes visual grounding by localizing regions that contain critical visual cues and dynamically zooming in to capture context at the appropriate granularity. This mechanism leads to superior generalization to unseen objects, backgrounds, and distractors.
  • Figure 2: Overall framework of VCoT-Grasp. Our grasp model architecture is built on the Paligemma-3B VLM steiner2024paligemma, which takes as input projected visual embeddings and tokenized task instructions, and employs multi-turn learning to predict both the location tokens of the target object and the grasp pose tokens. We evaluate multiple action head designs, where these decoders exploit fine-grained visual information to generate grasp poses either in an autoregressive or regression manner.
  • Figure 3: Dataset Statistics. We report the number of (a) seen and (b) unseen objects. In the seen scenario, Others denotes the aggregated count of the remaining 347 categories.
  • Figure 4: Real-world experimental setup and objects used in our evaluation.
  • Figure 5: (a) Grasp prediction performance as a function of the number of training samples. (b) Effect of training epochs, where overfitting emerges after the fifth epoch.
  • ...and 2 more figures