Table of Contents
Fetching ...

Zero-Shot Multi-Object Scene Completion

Shun Iwase, Katherine Liu, Vitor Guizilini, Adrien Gaidon, Kris Kitani, Rares Ambrus, Sergey Zakharov

TL;DR

This work tackles zero-shot multi-object scene completion from a single RGB-D image by introducing OctMAE, a hybrid architecture that fuses an Octree U-Net with a latent 3D Masked Autoencoder. Key innovations include an occlusion-aware masking strategy and 3D rotary embeddings (RoPE) to enable efficient full-attention in latent space, addressing memory and scale challenges typical of 3D MAEs. A large-scale synthetic dataset of 12K Objaverse/GSO models and BlenderProc rendering supports broad generalization to real-world scenes, demonstrating strong performance on both synthetic and real datasets without object-specific priors. The results establish state-of-the-art completion quality, fast runtimes, and clear ablation-driven insights into the roles of latent MAE, RoPE, and occlusion masking, with implications for robotic manipulation and planning in cluttered environments.

Abstract

We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Despite notable advancements in single-object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object scene completion through both local and global geometric reasoning. Because a naive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and scene completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.

Zero-Shot Multi-Object Scene Completion

TL;DR

This work tackles zero-shot multi-object scene completion from a single RGB-D image by introducing OctMAE, a hybrid architecture that fuses an Octree U-Net with a latent 3D Masked Autoencoder. Key innovations include an occlusion-aware masking strategy and 3D rotary embeddings (RoPE) to enable efficient full-attention in latent space, addressing memory and scale challenges typical of 3D MAEs. A large-scale synthetic dataset of 12K Objaverse/GSO models and BlenderProc rendering supports broad generalization to real-world scenes, demonstrating strong performance on both synthetic and real datasets without object-specific priors. The results establish state-of-the-art completion quality, fast runtimes, and clear ablation-driven insights into the roles of latent MAE, RoPE, and occlusion masking, with implications for robotic manipulation and planning in cluttered environments.

Abstract

We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Despite notable advancements in single-object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object scene completion through both local and global geometric reasoning. Because a naive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and scene completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.
Paper Structure (44 sections, 9 equations, 8 figures, 8 tables)

This paper contains 44 sections, 9 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Given an RGB-D image and the foreground mask of multiple objects not seen during training, our method predicts their complete 4D shapes quickly and accurately, including occluded areas. (Left) Synthetic image results. (Right) Zero-shot generalization to a real-world image of household objects with noisy depth data. Our 3D results are rotated with respect to the input to highlight completions in occluded regions.
  • Figure 2: Overview of our proposed method (OctMAE). Given an input RGB Image $\mathbf{I}$, depth map $\mathbf{D}$, and a foreground mask $\mathbf{M}$, the octree feature $\mathbf{F}$ is obtained by unprojecting an image feature encoded by a pre-trained image encoder $\mathbf{E}$. The octree feature is then encoded by the Octree encoder and downsampled to the Level of Detail (LoD) of $5$. The notation LoD-$h$ indicates that each axis of the voxel grid has resolution of $2^h$. The latent 3D MAE takes the encoded Octree feature $\mathbf{F}$ as input and its output feature is concatenated with the occlusion mask tokens $\mathbf{T}$. Next, the masked decoded feature $\mathbf{F}_{ML}$ is computed by sparse 3D MAE decoder. Finally, the Octree decoder predicts a completed surface at LoD-$9$.
  • Figure 3: Example images of our synthetic dataset. We use BlenderProc Denninger2023 to acquire high-quality images under various and realistic illumination conditions.
  • Figure 4: Overall architecture of Latent 3D MAE.
  • Figure 5: Scaling of the metrics with the number of objects in a training dataset. We conduct the experiments by changing the ratio of the number of objects to $1$%, $5$%, $10$%, $20$%, $40$%, $60$%, $80$%, and $100$%.
  • ...and 3 more figures