Table of Contents
Fetching ...

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Yaoyao Qian, Xupeng Zhu, Ondrej Biza, Shuo Jiang, Linfeng Zhao, Haojie Huang, Yu Qi, Robert Platt

TL;DR

This work tackles robotic grasping in cluttered environments by integrating a vision-language framework with goal-directed reasoning. ThinkGrasp uses GPT-4o to imagine segmentation targets under natural language instructions, a 3×3 grid to select robust grasp regions, and LangSAM/VLPart for precise segmentation, all within a closed-loop loop that updates after each grasp. The approach delivers state-of-the-art performance in heavy clutter and unseen objects in both simulated and real settings, with comprehensive ablations confirming the contribution of each component. The system demonstrates strong generalization, modularity, and practical impact for reliable grasping in complex environments, while acknowledging current limitations such as single-view reconstruction and grasp-only tasks. These insights facilitate scalable, language-conditioned manipulation in real-world robotics.

Abstract

Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

TL;DR

This work tackles robotic grasping in cluttered environments by integrating a vision-language framework with goal-directed reasoning. ThinkGrasp uses GPT-4o to imagine segmentation targets under natural language instructions, a 3×3 grid to select robust grasp regions, and LangSAM/VLPart for precise segmentation, all within a closed-loop loop that updates after each grasp. The approach delivers state-of-the-art performance in heavy clutter and unseen objects in both simulated and real settings, with comprehensive ablations confirming the contribution of each component. The system demonstrates strong generalization, modularity, and practical impact for reliable grasping in complex environments, while acknowledging current limitations such as single-view reconstruction and grasp-only tasks. These insights facilitate scalable, language-conditioned manipulation in real-world robotics.

Abstract

Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
Paper Structure (23 sections, 4 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: ThinkGrasp pipeline for cluttered environments
  • Figure 2: Closed-loop grasping process demonstrating
  • Figure 3: Clutter cases in simulation. The target objects are labeled with stars.
  • Figure 4: Heavy Clutter cases in simulation. The target objects are labeled with stars.
  • Figure 5: Real Robot Task