SceneComplete: Open-World 3D Scene Completion in Cluttered Real World Environments for Robot Manipulation
Aditya Agarwal, Gaurav Singh, Bipasha Sen, Tomás Lozano-Pérez, Leslie Pack Kaelbling
TL;DR
This work tackles open-world 3D scene completion from a single RGB-D image to aid robot manipulation. It introduces SceneComplete, a modular pipeline that assembles open-domain perception modules (vision-language prompting, segmentation, image inpainting, image-to-3D mesh generation, dense-descriptor-based scaling, and 6DOF pose estimation) with minimal task-specific training. The system outputs complete, per-object meshes registered to the observed scene, enabling robust grasping and dexterous manipulation in clutter. Across GraspNet-1B, YCB-Video, and real-robot trials, SceneComplete achieves higher reconstruction fidelity and grasp success than baselines, demonstrating practical impact for real-world manipulation in open-world environments.
Abstract
Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand. We release the code and additional results on our website.
