Table of Contents
Fetching ...

Obstruction reasoning for robotic grasping

Runyu Jiao, Matteo Bortolon, Francesco Giuliari, Alice Fasoli, Sergio Povoli, Guofeng Mei, Yiming Wang, Fabio Poiesi

TL;DR

Obstruction reasoning is essential for robust robotic grasping in clutter. The authors propose UNOGrasp, a vision-language framework that grounds a target object and reason through obstruction paths via a target-centric graph, trained with supervised fine-tuning and reinforcement fine-tuning using obstruction-aware rewards. They also introduce UNOBench, a large-scale dataset with 100k obstruction paths for training and benchmarking obstruction reasoning in both synthetic and real scenes. Across synthetic and real-world tests, UNOGrasp achieves state-of-the-art obstruction reasoning and higher grasp success, with ablations confirming the value of obstruction cues and IoU-based rewards for learning robust, grounded multi-step plans.

Abstract

Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.

Obstruction reasoning for robotic grasping

TL;DR

Obstruction reasoning is essential for robust robotic grasping in clutter. The authors propose UNOGrasp, a vision-language framework that grounds a target object and reason through obstruction paths via a target-centric graph, trained with supervised fine-tuning and reinforcement fine-tuning using obstruction-aware rewards. They also introduce UNOBench, a large-scale dataset with 100k obstruction paths for training and benchmarking obstruction reasoning in both synthetic and real scenes. Across synthetic and real-world tests, UNOGrasp achieves state-of-the-art obstruction reasoning and higher grasp success, with ablations confirming the value of obstruction cues and IoU-based rewards for learning robust, grounded multi-step plans.

Abstract

Successful robotic grasping in cluttered environments not only requires a model to visually ground a target object but also to reason about obstructions that must be cleared beforehand. While current vision-language embodied reasoning models show emergent spatial understanding, they remain limited in terms of obstruction reasoning and accessibility planning. To bridge this gap, we present UNOGrasp, a learning-based vision-language model capable of performing visually-grounded obstruction reasoning to infer the sequence of actions needed to unobstruct the path and grasp the target object. We devise a novel multi-step reasoning process based on obstruction paths originated by the target object. We anchor each reasoning step with obstruction-aware visual cues to incentivize reasoning capability. UNOGrasp combines supervised and reinforcement finetuning through verifiable reasoning rewards. Moreover, we construct UNOBench, a large-scale dataset for both training and benchmarking, based on MetaGraspNetV2, with over 100k obstruction paths annotated by humans with obstruction ratios, contact points, and natural-language instructions. Extensive experiments and real-robot evaluations show that UNOGrasp significantly improves obstruction reasoning and grasp success across both synthetic and real-world environments, outperforming generalist and proprietary alternatives. Project website: https://tev-fbk.github.io/UnoGrasp/.

Paper Structure

This paper contains 26 sections, 10 equations, 20 figures, 11 tables, 1 algorithm.

Figures (20)

  • Figure 1: UNOGrasp performs multi-step obstruction reasoning for robotic grasping in cluttered scenes. Given an RGB-D image and a natural-language goal (e.g., grasp the white iphone box), UNOGrasp reasons and grounds spatial information to infer the sequence of steps to unobstruct a requested object. We also introduce UNOBench to comprehensively benchmark obstruction reasoning.
  • Figure 2: UNOBench features two unique characteristics: (i) human-annotated free-form language instructions about objects in cluttered bins, and (ii) per-bin obstruction graphs for grounded spatial reasoning. Human annotators through the Prolific platform were involved to refine the initial GPT-4o generated annotations. UNOBench features three levels of difficulty and introduces novel evaluation metrics.
  • Figure 3: UNOGrasp is a VLM trained through supervised fine (SFT) on UNOBench to learn structured obstruction-path reasoning, and through GRPO-based reinforcement finetuning (RFT) to further boost its reasoning ability using outcome-driven IoU and format rewards. During inference, given an RGB image and a target object as language instruction, UNOGrasp reasons over multiple obstruction paths ($\texttt{<think>}$ traces) and directly outputs the sequence of actions ($\texttt{<answer>}$) required to remove obstructions and grasp the target.
  • Figure 4: Qualitative results on UNOBench different splits, and in two types of failure. mark the target object, the top obstructor, UNOGrasp, Gemini Robotics-ER 1.5, and Qwen2.5-VL (ICL) predictions with their reasoning traces. $(\text{SR-F1} / \text{MP\_NED})$ scores are reported at the bottom of each image.
  • Figure 5: Qualitative results from laboratory robotics experiments. mark the target object, UNOGrasp, Gemini Robotics-ER 1.5, and Qwen2.5-VL (ICL). Labels are shown for misaligned predictions (labels-spatial location disagreement). Difficulty level and target prompt are display at the top of the figure.
  • ...and 15 more figures