You Only Scan Once: A Dynamic Scene Reconstruction Pipeline for 6-DoF Robotic Grasping of Novel Objects
Lei Zhou, Haozhe Wang, Zhengshen Zhang, Zhiyang Liu, Francis EH Tay, adn Marcelo H. Ang.
TL;DR
The paper tackles robust 6-DoF robotic grasping in dynamic environments by addressing occlusion and static-scene limitations with a two-stage dynamic scene reconstruction pipeline called You Only Scan Once (YOSO). Stage I performs a single RGB-D scan to register novel objects and generate meshes, while Stage II tracks object poses to reinsert these meshes into the scene, producing a complete, up-to-date scene point cloud for grasp planning. The method combines a Video-segmentation module (XMem), a 6D Pose Tracker with a NeRF-based mesh generator (BundleSDF) to build object geometry, and a 6-DoF Grasp Pose Predictor (Scale-balanced GraspNet) to predict grasps on the reconstructed scene, achieving significant gains on GraspNet-1Billion and extending the dataset with fully visible scenes. Results show near real-time performance and substantial improvements over partial-point-cloud baselines, suggesting practical impact for real-world robotic grasping in dynamic settings.
Abstract
In the realm of robotic grasping, achieving accurate and reliable interactions with the environment is a pivotal challenge. Traditional methods of grasp planning methods utilizing partial point clouds derived from depth image often suffer from reduced scene understanding due to occlusion, ultimately impeding their grasping accuracy. Furthermore, scene reconstruction methods have primarily relied upon static techniques, which are susceptible to environment change during manipulation process limits their efficacy in real-time grasping tasks. To address these limitations, this paper introduces a novel two-stage pipeline for dynamic scene reconstruction. In the first stage, our approach takes scene scanning as input to register each target object with mesh reconstruction and novel object pose tracking. In the second stage, pose tracking is still performed to provide object poses in real-time, enabling our approach to transform the reconstructed object point clouds back into the scene. Unlike conventional methodologies, which rely on static scene snapshots, our method continuously captures the evolving scene geometry, resulting in a comprehensive and up-to-date point cloud representation. By circumventing the constraints posed by occlusion, our method enhances the overall grasp planning process and empowers state-of-the-art 6-DoF robotic grasping algorithms to exhibit markedly improved accuracy.
