Table of Contents
Fetching ...

Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

Dilermando Almeida, Juliano Negri, Guilherme Lazzarini, Thiago H. Segreto, Ranulfo Bezerra, Ricardo V. Godoy, Marcelo Becker

TL;DR

An end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot is presented, demonstrating substantially improved robustness to occlusions and partial observations in clutter.

Abstract

Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.

Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

TL;DR

An end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot is presented, demonstrating substantially improved robustness to occlusions and partial observations in clutter.

Abstract

Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
Paper Structure (34 sections, 5 equations, 5 figures, 1 table)

This paper contains 34 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: A legged mobile manipulator performs language-guided object grasping in a cluttered, unstructured scene under partial observations. The proposed pipeline grounds a natural-language target in RGB using open-vocabulary detection and segmentation, estimates object-centric 3D geometry from RGB-D, including completion under occlusion, and selects and executes a reliable 6-DoF grasp on the real robot.
  • Figure 2: System overview of the proposed viewpoint-agnostic grasping pipeline. The system receives a natural-language target prompt (e.g., "blue bottle") together with synchronized RGB and depth observations. The prompt is grounded in RGB using Grounding DINO liu2024grounding to obtain a target bounding box and SAM 2 ravisam to produce an instance mask. The mask is then used to extract an object-centric partial point cloud from depth via Isaac ROS Nvblox millane2024nvblox using depth backprojection. To mitigate occlusions and sparse depth, the object geometry is completed in two stages: MGPC liu2026mgpc generates synthetic points conditioned on the prompt, RGB, and the partial point cloud, and PoinTr yu2021pointr further densifies the geometry by completing fixed-size local patches. Given the densified point cloud, Grasp Pose Generator (GPG) gpg2016 samples antipodal 6-DoF grasp candidates, which are collision-filtered against nearby scene points and ranked to select an execution-feasible grasp. Finally, the robot executes a state-machine locomanipulation routine that (when needed) repositions the base for reachability and clearance, followed by end-effector approach, grasp closure, and object lift.
  • Figure 3: Spot front-left registered stereo and RGB images example taken with the robot still. The images showcase the noise and limited resolution of the available sensors.
  • Figure 4: Experimental setups for evaluating the viewpoint-agnostic grasp pipeline. The environments consist of cluttered industrial and household objects. Setup A (left) was used for experiments to identify and grasp a power drill partially obscured by boxes and electrical components. Setup B (right) requires the pipeline to target a blue bottle situated behind different boxes. These configurations test the model's ability to generate grasps despite the challenging scenarios.
  • Figure 5: Sequence demonstrating the grasp execution experiments using the proposed end-to-end pipeline on the real robot. (a) After language-guided target selection and instance segmentation, the system estimates object-centric 3D geometry from partial RGB-D observations (including completion), selects an execution-feasible 6-DoF grasp under collision and reachability constraints, and repositions the base to satisfy reachability and clearance for the planned approach. (b) The robot aligns to the target and commands the arm to a collision-free pre-grasp pose with a safety offset. (c) The end-effector executes a short Cartesian insertion along the grasp approach direction and closes the gripper to secure the object. (d) The object is lifted to confirm grasp success and stability under post-grasp interaction.