GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

Sungjae Lee; Yeonjoo Hong; Kwang In Kim

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

Sungjae Lee, Yeonjoo Hong, Kwang In Kim

TL;DR

GraspCorrect tackles unstable robotic grasping by introducing a plug-and-play module that leverages vision-language models to guide grasp detection, generate a realistic visual goal, and refine actions with a diffusion-based controller. The method decomposes the task into three stages—VLM-guided grasp detection with grasp-guided prompts and object-aware sampling, visual goal generation via image composition, and goal-conditioned behavioral cloning for joint actions—yielding architecture-agnostic improvement across RLBench and CALVIN benchmarks. Ablation studies confirm the value of grasp-guided prompts and object-aware sampling, with substantial gains in grasp reliability and pose accuracy. The results suggest GraspCorrect can meaningfully enhance robustness of diverse manipulation policies in real-world settings without extensive retraining, enabling more reliable long-horizon robotic tasks.

Abstract

Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets.

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

TL;DR

Abstract

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)