VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation
Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Yuedi Zhang, Qi Zhang, Pengxiang Ding, Cheng Chi, Donglin Wang, Badong Chen
TL;DR
VCoT-Grasp introduces a visual chain-of-thought framework for language-driven grasp generation, enabling end-to-end reasoning that localizes target objects via intermediate bounding boxes before predicting grasps. Built on a Paligemma-based VLM, the model supports multiple action-head designs and leverages multi-turn in-context reasoning to improve fine-grained visual understanding in cluttered scenes. A refined dataset, VCoT-GraspSet, combines 167K synthetic images (1.36M grasps) with 400+ real-world images (1.2K grasps) and includes intermediate bounding boxes as chain-of-thought context. Experiments show superior in-distribution performance and robust generalization to unseen objects, distractors, and background changes, including strong zero-shot transfer to real robots, suggesting significant practical impact for language-guided robotic manipulation.
Abstract
Robotic grasping is one of the most fundamental tasks in robotic manipulation, and grasp detection/generation has long been the subject of extensive research. Recently, language-driven grasp generation has emerged as a promising direction due to its practical interaction capabilities. However, most existing approaches either lack sufficient reasoning and generalization capabilities or depend on complex modular pipelines. Moreover, current grasp foundation models tend to overemphasize dialog and object semantics, resulting in inferior performance and restriction to single-object grasping. To maintain strong reasoning ability and generalization in cluttered environments, we propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates visual chain-of-thought reasoning to enhance visual understanding for grasp generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically focuses on visual inputs while providing interpretable reasoning traces. For training, we refine and introduce a large-scale dataset, VCoT-GraspSet, comprising 167K synthetic images with over 1.36M grasps, as well as 400+ real-world images with more than 1.2K grasps, annotated with intermediate bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot demonstrate that our method significantly improves grasp success rates and generalizes effectively to unseen objects, backgrounds, and distractors. More details can be found at https://zhanghr2001.github.io/VCoT-Grasp.github.io.
