ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter
Yaoyao Qian, Xupeng Zhu, Ondrej Biza, Shuo Jiang, Linfeng Zhao, Haojie Huang, Yu Qi, Robert Platt
TL;DR
This work tackles robotic grasping in cluttered environments by integrating a vision-language framework with goal-directed reasoning. ThinkGrasp uses GPT-4o to imagine segmentation targets under natural language instructions, a 3×3 grid to select robust grasp regions, and LangSAM/VLPart for precise segmentation, all within a closed-loop loop that updates after each grasp. The approach delivers state-of-the-art performance in heavy clutter and unseen objects in both simulated and real settings, with comprehensive ablations confirming the contribution of each component. The system demonstrates strong generalization, modularity, and practical impact for reliable grasping in complex environments, while acknowledging current limitations such as single-view reconstruction and grasp-only tasks. These insights facilitate scalable, language-conditioned manipulation in real-world robotics.
Abstract
Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
