Table of Contents
Fetching ...

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

Sungjae Lee, Yeonjoo Hong, Kwang In Kim

TL;DR

GraspCorrect tackles unstable robotic grasping by introducing a plug-and-play module that leverages vision-language models to guide grasp detection, generate a realistic visual goal, and refine actions with a diffusion-based controller. The method decomposes the task into three stages—VLM-guided grasp detection with grasp-guided prompts and object-aware sampling, visual goal generation via image composition, and goal-conditioned behavioral cloning for joint actions—yielding architecture-agnostic improvement across RLBench and CALVIN benchmarks. Ablation studies confirm the value of grasp-guided prompts and object-aware sampling, with substantial gains in grasp reliability and pose accuracy. The results suggest GraspCorrect can meaningfully enhance robustness of diverse manipulation policies in real-world settings without extensive retraining, enabling more reliable long-horizon robotic tasks.

Abstract

Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets.

GraspCorrect: Robotic Grasp Correction via Vision-Language Model-Guided Feedback

TL;DR

GraspCorrect tackles unstable robotic grasping by introducing a plug-and-play module that leverages vision-language models to guide grasp detection, generate a realistic visual goal, and refine actions with a diffusion-based controller. The method decomposes the task into three stages—VLM-guided grasp detection with grasp-guided prompts and object-aware sampling, visual goal generation via image composition, and goal-conditioned behavioral cloning for joint actions—yielding architecture-agnostic improvement across RLBench and CALVIN benchmarks. Ablation studies confirm the value of grasp-guided prompts and object-aware sampling, with substantial gains in grasp reliability and pose accuracy. The results suggest GraspCorrect can meaningfully enhance robustness of diverse manipulation policies in real-world settings without extensive retraining, enabling more reliable long-horizon robotic tasks.

Abstract

Despite significant advancements in robotic manipulation, achieving consistent and stable grasping remains a fundamental challenge, often limiting the successful execution of complex tasks. Our analysis reveals that even state-of-the-art policy models frequently exhibit unstable grasping behaviors, leading to failure cases that create bottlenecks in real-world robotic applications. To address these challenges, we introduce GraspCorrect, a plug-and-play module designed to enhance grasp performance through vision-language model-guided feedback. GraspCorrect employs an iterative visual question-answering framework with two key components: grasp-guided prompting, which incorporates task-specific constraints, and object-aware sampling, which ensures the selection of physically feasible grasp candidates. By iteratively generating intermediate visual goals and translating them into joint-level actions, GraspCorrect significantly improves grasp stability and consistently enhances task success rates across existing policy models in the RLBench and CALVIN datasets.

Paper Structure

This paper contains 28 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Importance of precise gripper action. Left: Visualization of successful and failed cases in the RLBench insert peg task. Right: Performance improvement on challenging RLBench tasks. By replicating demonstrated stable grasp poses up to the grasping point, we observe substantial improvements in task success rates (%). This preliminary result highlights the significant impact of robust grasping on the overall performance of end-to-end robotic policy models.
  • Figure 2: Overview of the GraspCorrect process. This module enhances robotic manipulation by establishing a stable grasp as a critical milestone. In the Grasp Detection stage, task-specific VLM guidance predicts the desired gripper positioning through an iterative question-answering process. The Visual Goal Generation stage then synthesizes a goal-state image via image composition, representing the ideal grasp configuration. Finally, the Action Generation stage predicts and executes corrective actions, improving grasping reliability.
  • Figure 3: Visualization of iterative grasp point refinement using PIVOT NXY24 (top) and our method (bottom). The circles represent grasp candidates sampled by each algorithm, with red circles indicating those selected for the next sampling stage. Due to its lack of target-specific contextualization, PIVOT often predicts grasp points that fail to make contact with the object. In contrast, our method ensures all selected grasp locations are physically viable. The left and right gripper positions are aligned within the camera's image pane, making it sufficient to generate grasp points near the image boundaries (see \ref{['f:failure']}, left).
  • Figure 4: Evaluation of diffusion-based models for generating goal-state images in robotic manipulation tasks. The input image (top-left) shows the initial gripper configuration approaching a blue square object, while the expected output (bottom-left) represents the ground-truth stable grasp pose from a successful RLBench insert peg demonstration. Existing models struggle to accurately capture the required details, spatial arrangements, and contextual relevance essential for precise robotic grasping.
  • Figure 5: Data generation process for policy training. Left: waypoints are randomly varied to introduce realistic grasping variations. Middle: the resulting grasp pose with variation is saved. Right: the grasp pose without waypoint variation represents the stable grasp.
  • ...and 2 more figures