Table of Contents
Fetching ...

SceneComplete: Open-World 3D Scene Completion in Cluttered Real World Environments for Robot Manipulation

Aditya Agarwal, Gaurav Singh, Bipasha Sen, Tomás Lozano-Pérez, Leslie Pack Kaelbling

TL;DR

This work tackles open-world 3D scene completion from a single RGB-D image to aid robot manipulation. It introduces SceneComplete, a modular pipeline that assembles open-domain perception modules (vision-language prompting, segmentation, image inpainting, image-to-3D mesh generation, dense-descriptor-based scaling, and 6DOF pose estimation) with minimal task-specific training. The system outputs complete, per-object meshes registered to the observed scene, enabling robust grasping and dexterous manipulation in clutter. Across GraspNet-1B, YCB-Video, and real-robot trials, SceneComplete achieves higher reconstruction fidelity and grasp success than baselines, demonstrating practical impact for real-world manipulation in open-world environments.

Abstract

Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand. We release the code and additional results on our website.

SceneComplete: Open-World 3D Scene Completion in Cluttered Real World Environments for Robot Manipulation

TL;DR

This work tackles open-world 3D scene completion from a single RGB-D image to aid robot manipulation. It introduces SceneComplete, a modular pipeline that assembles open-domain perception modules (vision-language prompting, segmentation, image inpainting, image-to-3D mesh generation, dense-descriptor-based scaling, and 6DOF pose estimation) with minimal task-specific training. The system outputs complete, per-object meshes registered to the observed scene, enabling robust grasping and dexterous manipulation in clutter. Across GraspNet-1B, YCB-Video, and real-robot trials, SceneComplete achieves higher reconstruction fidelity and grasp success than baselines, demonstrating practical impact for real-world manipulation in open-world environments.

Abstract

Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. SceneComplete is a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, visual-descriptors and pose-estimation) to obtain highly accurate results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand. We release the code and additional results on our website.

Paper Structure

This paper contains 15 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) takes as input a single RGB-D image of a given scene, visualized here as a point cloud; (b) produces high-quality fully completed, accurately segmented object meshes in scenes with substantial occlusion and novel objects; and (c) enables downstream dexterous manipulation that requires accurate complete shape information.
  • Figure 2: Overview of the SceneComplete pipeline. Starting from a single RGB-D input, the system produces a set of object meshes registered with the input 3D scan, yielding a complete 3D scene reconstruction. The pipeline consists of six key phases: (1) An RGB image is fed into a VLM to enumerate and describe objects, (2) object descriptions and the RGB image are processed by a grounded segmentation model to generate object masks, (3) occluded regions are completed via image inpainting model adapted to output single fully observable objects on a white background, (4) the inpainted 2D images are passed into an image-to-3D model to produce object meshes, (5) object meshes are scaled according to the segmented partial point cloud, and (6) mesh poses are adjusted within the 3D coordinate frame of the original scan using 6DOF pose estimation. Each step leverages pre-trained open world large vision models, enabling scalability and benefiting from future model improvements.
  • Figure 3: In the image inpainting module, occluded objects (blue borders) are transformed into single fully observable objects.
  • Figure 4: (a) The impact of inpainting on image-to-3D reconstruction. Without inpainting (top), the image-to-3D model generates incomplete meshes. Inpainting (bottom) fills in occluded parts, producing accurate 3D reconstructions. (b) Comparison of inpainting models. Unadapted BrushNet (middle) introduces artifacts, while the adapted version (right) inpaints occluded parts correctly producing a fully observed object.
  • Figure 5: Qualitative comparisons of scene reconstructions on the GraspNet-1B dataset. For each scene we show, the input RGB-D image, OctMAE reconstruction (rendered as normal maps as it predicts scene-level occupancy values), ZeroGrasp reconstruction (rendered as normal maps), our reconstruction (visualized as individually reconstructed object meshes color-matched to the ground truth), and ground-truth object meshes. Highlighted regions indicate missing area (black) or spurious region connecting distinct objects (red).
  • ...and 3 more figures