You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

Lei Zhou; Haozhe Wang; Zhengshen Zhang; Zhiyang Liu; Francis EH Tay; adn Marcelo H. Ang.

You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

Lei Zhou, Haozhe Wang, Zhengshen Zhang, Zhiyang Liu, Francis EH Tay, adn Marcelo H. Ang.

TL;DR

The paper tackles robust 6-DoF robotic grasping in dynamic environments by addressing occlusion and static-scene limitations with a two-stage dynamic scene reconstruction pipeline called You Only Scan Once (YOSO). Stage I performs a single RGB-D scan to register novel objects and generate meshes, while Stage II tracks object poses to reinsert these meshes into the scene, producing a complete, up-to-date scene point cloud for grasp planning. The method combines a Video-segmentation module (XMem), a 6D Pose Tracker with a NeRF-based mesh generator (BundleSDF) to build object geometry, and a 6-DoF Grasp Pose Predictor (Scale-balanced GraspNet) to predict grasps on the reconstructed scene, achieving significant gains on GraspNet-1Billion and extending the dataset with fully visible scenes. Results show near real-time performance and substantial improvements over partial-point-cloud baselines, suggesting practical impact for real-world robotic grasping in dynamic settings.

Abstract

In the realm of robotic grasping, achieving accurate and reliable interactions with the environment is a pivotal challenge. Traditional methods of grasp planning methods utilizing partial point clouds derived from depth image often suffer from reduced scene understanding due to occlusion, ultimately impeding their grasping accuracy. Furthermore, scene reconstruction methods have primarily relied upon static techniques, which are susceptible to environment change during manipulation process limits their efficacy in real-time grasping tasks. To address these limitations, this paper introduces a novel two-stage pipeline for dynamic scene reconstruction. In the first stage, our approach takes scene scanning as input to register each target object with mesh reconstruction and novel object pose tracking. In the second stage, pose tracking is still performed to provide object poses in real-time, enabling our approach to transform the reconstructed object point clouds back into the scene. Unlike conventional methodologies, which rely on static scene snapshots, our method continuously captures the evolving scene geometry, resulting in a comprehensive and up-to-date point cloud representation. By circumventing the constraints posed by occlusion, our method enhances the overall grasp planning process and empowers state-of-the-art 6-DoF robotic grasping algorithms to exhibit markedly improved accuracy.

You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

TL;DR

Abstract

Paper Structure (21 sections, 3 equations, 4 figures, 3 tables)

This paper contains 21 sections, 3 equations, 4 figures, 3 tables.

INTRODUCTION
RELATED WORKS
Grasping Methods Utilizing Partial Point Clouds
Grasping Methods Utilizing Single-view Shape Completion
Static Scene Reconstruction Methods
TSDF-based Methods
NeRF-based Methods
METHOD
Video-segmentation Module
6D Object Pose Tracker and Mesh Generator
6D Pose Tracker for Novel Object
NeRF-based Mesh Generator
6-DoF Grasp Pose Predictor
EXPERIMENTS
Benchmark and Metric
...and 6 more sections

Figures (4)

Figure 1: Dynamic scene reconstruction and grasp generation. (a) RGB-D images are captured by an RGB-D camera as it scans the grasping workspace. (b) A Video-segmentation Module segments the graspable objects in the scene. (c) Using the RGB-D images and masks from (a) and (b), we reconstruct the meshes of the graspable objects and merge them with the original partial point cloud to create a full point cloud of the workspace. (d) Finally, a Grasp Pose Predictor is used to generate the valid grasps based on the reconstructed full point cloud.
Figure 2: Overview of the proposed pipeline. Stage I: Given a monocular RGB-D video, object masks are segmented using a Video-segmentation Module. Subsequently, feature matching is performed in the Object Pose Tracker and Mesh Generator module to simultaneously track object pose and reconstruct object mesh. Keyframes with informative historical observations are stored in the memory pool to facilitate pose tracking in both stages. Stage II: In testing, given an RGB-D image, the masks of the objects in the workspace are segmented out and the object pose is estimated by taking the Keyframe Memory Pool as a reference. Subsequently, the reconstructed meshes are transformed into camera coordinates with the estimated object pose. Taking this reconstructed scene point cloud, grasp generation is performed to generate the top k grasp poses for real-world experiments. The dotted lines represent the supplementation of historical information.
Figure 3: Configuration of real-world experiment.
Figure 4: Qualitative comparison of grasp prediction with partial point cloud and reconstructed scene on GraspNet-1Billion dataset. Color varies from red to blue to represent the grasp quality from high to low. (a). Partial point cloud back-projected from depth image. (b). Grasps that are generated on a partial point cloud. (c). Reconstructed scene from YOSO pipeline. (d). Grasps that are generated on the reconstructed scene. (e). Complete scene-level point cloud. (f). Grasps generated on the complete scene-level point cloud.

You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

TL;DR

Abstract

You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects

Authors

TL;DR

Abstract

Table of Contents

Figures (4)