Table of Contents
Fetching ...

RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

Chang Liu, Kejian Shi, Kaichen Zhou, Haoxiao Wang, Jiyao Zhang, Hao Dong

TL;DR

RGBGrasp tackles robust 3D perception for robotic grasping under limited RGB views by integrating monocular depth priors with neural radiance fields. The method operates with an eye-on-hand camera to accumulate views during manipulation, and introduces a depth rank loss, hash encoding, and a proposal sampler to accelerate NeRF-based scene reconstruction. A downstream grasp detector then computes 6-DoF poses from the reconstructed point cloud, enabling real-time grasping. Across simulation and real-robot experiments, RGBGrasp demonstrates strong performance on diffuse, transparent, and specular objects, outperforming several RGB- and RGB-D-based baselines while reducing training and inference time.

Abstract

Robotic research encounters a significant hurdle when it comes to the intricate task of grasping objects that come in various shapes, materials, and textures. Unlike many prior investigations that heavily leaned on specialized point-cloud cameras or abundant RGB visual data to gather 3D insights for object-grasping missions, this paper introduces a pioneering approach called RGBGrasp. This method depends on a limited set of RGB views to perceive the 3D surroundings containing transparent and specular objects and achieve accurate grasping. Our method utilizes pre-trained depth prediction models to establish geometry constraints, enabling precise 3D structure estimation, even under limited view conditions. Finally, we integrate hash encoding and a proposal sampler strategy to significantly accelerate the 3D reconstruction process. These innovations significantly enhance the adaptability and effectiveness of our algorithm in real-world scenarios. Through comprehensive experimental validations, we demonstrate that RGBGrasp achieves remarkable success across a wide spectrum of object-grasping scenarios, establishing it as a promising solution for real-world robotic manipulation tasks. The demonstrations of our method can be found on: https://sites.google.com/view/rgbgrasp

RGBGrasp: Image-based Object Grasping by Capturing Multiple Views during Robot Arm Movement with Neural Radiance Fields

TL;DR

RGBGrasp tackles robust 3D perception for robotic grasping under limited RGB views by integrating monocular depth priors with neural radiance fields. The method operates with an eye-on-hand camera to accumulate views during manipulation, and introduces a depth rank loss, hash encoding, and a proposal sampler to accelerate NeRF-based scene reconstruction. A downstream grasp detector then computes 6-DoF poses from the reconstructed point cloud, enabling real-time grasping. Across simulation and real-robot experiments, RGBGrasp demonstrates strong performance on diffuse, transparent, and specular objects, outperforming several RGB- and RGB-D-based baselines while reducing training and inference time.

Abstract

Robotic research encounters a significant hurdle when it comes to the intricate task of grasping objects that come in various shapes, materials, and textures. Unlike many prior investigations that heavily leaned on specialized point-cloud cameras or abundant RGB visual data to gather 3D insights for object-grasping missions, this paper introduces a pioneering approach called RGBGrasp. This method depends on a limited set of RGB views to perceive the 3D surroundings containing transparent and specular objects and achieve accurate grasping. Our method utilizes pre-trained depth prediction models to establish geometry constraints, enabling precise 3D structure estimation, even under limited view conditions. Finally, we integrate hash encoding and a proposal sampler strategy to significantly accelerate the 3D reconstruction process. These innovations significantly enhance the adaptability and effectiveness of our algorithm in real-world scenarios. Through comprehensive experimental validations, we demonstrate that RGBGrasp achieves remarkable success across a wide spectrum of object-grasping scenarios, establishing it as a promising solution for real-world robotic manipulation tasks. The demonstrations of our method can be found on: https://sites.google.com/view/rgbgrasp
Paper Structure (30 sections, 4 equations, 10 figures, 5 tables)

This paper contains 30 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of RGBGrasp. We introduce a novel approach capable of reconstructing the 3D geometric information of a target scene using views acquired during standard grasping procedures. Our method is not limited to fixed viewpoints and can flexibly work with partial observations in different trajectories based on the environmental requirements.
  • Figure 2: RGBGrasp Pipeline. The robot employs an approaching trajectory to the objects when capturing multiple views to build a multi-scale hash table. Subsequently, a proposal sampler is trained to enhance the precision of sampling positions for a subsequent fine predictor. This predictor provides color and density data for individual points, where the density information contributes to the construction of the final point cloud. This resultant point cloud serves as input for a pre-trained grasping module to predict a 6-DoF grasp pose. Throughout the optimization procedure, we maintain a fixed state for the monocular depth network and the grasping module, rendering them non-trainable components. In contrast, the Hash Table, Proposal Sampler, and NeRF MLP are actively updated and subject to the learning process.
  • Figure 3: Comparison Between GraspNeRF and RGBGrasp (Ours) in Terms of Reconstructed Point Clouds. This figure presents a comparative analysis of point cloud reconstructions obtained using RGBGrasp and GraspNeRF under different trajectories, specifically, $90^\circ$ and $180^\circ$. The upper row provides a comparison in the "Packed" scenario, while the lower row presents the comparison in the "Pile" scenario.
  • Figure 4: Visualization of Trajectories with Different View Ranges. The visualization exclusively displays the covered areas achieved through multiple views. In the original configuration (on the left (a)), views are evenly distributed to cover a full 360° at a consistent height. We successively reduce the view range to 270°, 180° and 90°, resulting in the trajectories shown in (b), (c) and (d).
  • Figure 5: Visualization of Ablation Studies. This figure illustrates a comparative analysis between RGBGrasp and the ablation version of our method regarding the quality of 3D geometry reconstruction.
  • ...and 5 more figures