Table of Contents
Fetching ...

GraspView: Active Perception Scoring and Best-View Optimization for Robotic Grasping in Cluttered Environments

Shenglin Wang, Mingtong Dai, Jingxuan Su, Lingbo Liu, Chunjie Chen, Xinyu Wu, Liang Lin

TL;DR

GraspView tackles robust robotic grasping in cluttered environments using an RGB-only pipeline. It combines global 3D scene reconstruction from RGB with VGGT, a render-and-score active perception strategy guided by a vision-language model, and online metric alignment to recover the true scale $\\lambda$ between reconstructed geometry and robot kinematics. The method yields improved grasp success over RGB-D and single-view baselines, especially under heavy occlusion, near-field sensing, and transparent object interaction, by enabling occlusion-free viewpoint planning and accurate grasp execution via GraspNet. This approach provides a practical, depth-free alternative for reliable manipulation in unstructured real-world settings, with strong implications for accessible, robust robotic perception and manipulation research.

Abstract

Robotic grasping is a fundamental capability for autonomous manipulation, yet remains highly challenging in cluttered environments where occlusion, poor perception quality, and inconsistent 3D reconstructions often lead to unstable or failed grasps. Conventional pipelines have widely relied on RGB-D cameras to provide geometric information, which fail on transparent or glossy objects and degrade at close range. We present GraspView, an RGB-only robotic grasping pipeline that achieves accurate manipulation in cluttered environments without depth sensors. Our framework integrates three key components: (i) global perception scene reconstruction, which provides locally consistent, up-to-scale geometry from a single RGB view and fuses multi-view projections into a coherent global 3D scene; (ii) a render-and-score active perception strategy, which dynamically selects next-best-views to reveal occluded regions; and (iii) an online metric alignment module that calibrates VGGT predictions against robot kinematics to ensure physical scale consistency. Building on these tailor-designed modules, GraspView performs best-view global grasping, fusing multi-view reconstructions and leveraging GraspNet for robust execution. Experiments on diverse tabletop objects demonstrate that GraspView significantly outperforms both RGB-D and single-view RGB baselines, especially under heavy occlusion, near-field sensing, and with transparent objects. These results highlight GraspView as a practical and versatile alternative to RGB-D pipelines, enabling reliable grasping in unstructured real-world environments.

GraspView: Active Perception Scoring and Best-View Optimization for Robotic Grasping in Cluttered Environments

TL;DR

GraspView tackles robust robotic grasping in cluttered environments using an RGB-only pipeline. It combines global 3D scene reconstruction from RGB with VGGT, a render-and-score active perception strategy guided by a vision-language model, and online metric alignment to recover the true scale between reconstructed geometry and robot kinematics. The method yields improved grasp success over RGB-D and single-view baselines, especially under heavy occlusion, near-field sensing, and transparent object interaction, by enabling occlusion-free viewpoint planning and accurate grasp execution via GraspNet. This approach provides a practical, depth-free alternative for reliable manipulation in unstructured real-world settings, with strong implications for accessible, robust robotic perception and manipulation research.

Abstract

Robotic grasping is a fundamental capability for autonomous manipulation, yet remains highly challenging in cluttered environments where occlusion, poor perception quality, and inconsistent 3D reconstructions often lead to unstable or failed grasps. Conventional pipelines have widely relied on RGB-D cameras to provide geometric information, which fail on transparent or glossy objects and degrade at close range. We present GraspView, an RGB-only robotic grasping pipeline that achieves accurate manipulation in cluttered environments without depth sensors. Our framework integrates three key components: (i) global perception scene reconstruction, which provides locally consistent, up-to-scale geometry from a single RGB view and fuses multi-view projections into a coherent global 3D scene; (ii) a render-and-score active perception strategy, which dynamically selects next-best-views to reveal occluded regions; and (iii) an online metric alignment module that calibrates VGGT predictions against robot kinematics to ensure physical scale consistency. Building on these tailor-designed modules, GraspView performs best-view global grasping, fusing multi-view reconstructions and leveraging GraspNet for robust execution. Experiments on diverse tabletop objects demonstrate that GraspView significantly outperforms both RGB-D and single-view RGB baselines, especially under heavy occlusion, near-field sensing, and with transparent objects. These results highlight GraspView as a practical and versatile alternative to RGB-D pipelines, enabling reliable grasping in unstructured real-world environments.

Paper Structure

This paper contains 17 sections, 19 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of Our GraspView framework. It comprises three modules: (a) global perception scene construction module, (b) active perception scoring module, (c) best-view global grasping module.
  • Figure 2: Multi-view candidate observations of tabletop objects and the fused occlusion-aware global point cloud reconstruction via VGGT. The integration of complementary views alleviates single-view occlusion, yielding a consistent 3D representation that supports robust object localization and grasp planning.
  • Figure 3: Grasping execution in cluttered tabletop scenes. The robotic arm interacts with diverse objects, including bottles, a kettle, a ruler, and fruits, under partial occlusions. These examples demonstrate the ability of the proposed system to perform reliable grasping in dense and visually complex environments.
  • Figure 4: Experimental setup of grasping objects. The labeled items on the table (kettle, translucent bottle, carambola, banana, mango, tape, ruler, ketchup) are defined as the grasping targets. The remaining items are placed as distractors or obstacles to increase the difficulty of perception and grasp planning.
  • Figure 5: Qualitative results of tabletop grasping under occlusion. The Initial View illustrates the scene as observed by the wrist-mounted camera. RGB-D reconstruction suffers from incomplete or noisy geometry, particularly on transparent or occluded objects. GraspView (w/o Active) denotes the single-view variant, where limited visibility restricts accurate 3D reasoning. In contrast, GraspView (w/ Active) leverages NBV-based active perception to capture previously occluded regions (e.g., ruler), leading to a more complete and metrically consistent 3D reconstruction for grasp planning.