ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Yaoyao Qian; Xupeng Zhu; Ondrej Biza; Shuo Jiang; Linfeng Zhao; Haojie Huang; Yu Qi; Robert Platt

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Yaoyao Qian, Xupeng Zhu, Ondrej Biza, Shuo Jiang, Linfeng Zhao, Haojie Huang, Yu Qi, Robert Platt

TL;DR

This work tackles robotic grasping in cluttered environments by integrating a vision-language framework with goal-directed reasoning. ThinkGrasp uses GPT-4o to imagine segmentation targets under natural language instructions, a 3×3 grid to select robust grasp regions, and LangSAM/VLPart for precise segmentation, all within a closed-loop loop that updates after each grasp. The approach delivers state-of-the-art performance in heavy clutter and unseen objects in both simulated and real settings, with comprehensive ablations confirming the contribution of each component. The system demonstrates strong generalization, modularity, and practical impact for reliable grasping in complex environments, while acknowledging current limitations such as single-view reconstruction and grasp-only tasks. These insights facilitate scalable, language-conditioned manipulation in real-world robotics.

Abstract

Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for heavy clutter environment grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 23 sections, 4 equations, 5 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Robotic Grasping in Cluttered Environments:
Pre-trained Models for Robotic Grasping:
Method
Problem Definition
System Pipeline
GPT-4o's Role and Constraint Solver in Target Object Selection
$\mathbf{3{\times}3}$ Grid Strategy for Optimal Grasp Part Selection
Target Object Segmentation and Cropping Region Generation
Grasp Pose Generation and Selection
Closed-Loop System for Robustness in Heavy Clutter
Experiments
Simulation
Results.
...and 8 more sections

Figures (5)

Figure 1: ThinkGrasp pipeline for cluttered environments
Figure 2: Closed-loop grasping process demonstrating
Figure 3: Clutter cases in simulation. The target objects are labeled with stars.
Figure 4: Heavy Clutter cases in simulation. The target objects are labeled with stars.
Figure 5: Real Robot Task

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

TL;DR

Abstract

ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Authors

TL;DR

Abstract

Table of Contents

Figures (5)