Zero-Shot Multi-Object Scene Completion
Shun Iwase, Katherine Liu, Vitor Guizilini, Adrien Gaidon, Kris Kitani, Rares Ambrus, Sergey Zakharov
TL;DR
This work tackles zero-shot multi-object scene completion from a single RGB-D image by introducing OctMAE, a hybrid architecture that fuses an Octree U-Net with a latent 3D Masked Autoencoder. Key innovations include an occlusion-aware masking strategy and 3D rotary embeddings (RoPE) to enable efficient full-attention in latent space, addressing memory and scale challenges typical of 3D MAEs. A large-scale synthetic dataset of 12K Objaverse/GSO models and BlenderProc rendering supports broad generalization to real-world scenes, demonstrating strong performance on both synthetic and real datasets without object-specific priors. The results establish state-of-the-art completion quality, fast runtimes, and clear ablation-driven insights into the roles of latent MAE, RoPE, and occlusion masking, with implications for robotic manipulation and planning in cluttered environments.
Abstract
We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image. Despite notable advancements in single-object 3D shape completion, high-quality reconstructions in highly cluttered real-world multi-object scenes remains a challenge. To address this issue, we propose OctMAE, an architecture that leverages an Octree U-Net and a latent 3D MAE to achieve high-quality and near real-time multi-object scene completion through both local and global geometric reasoning. Because a naive 3D MAE can be computationally intractable and memory intensive even in the latent space, we introduce a novel occlusion masking strategy and adopt 3D rotary embeddings, which significantly improves the runtime and scene completion quality. To generalize to a wide range of objects in diverse scenes, we create a large-scale photorealistic dataset, featuring a diverse set of 12K 3D object models from the Objaverse dataset which are rendered in multi-object scenes with physics-based positioning. Our method outperforms the current state-of-the-art on both synthetic and real-world datasets and demonstrates a strong zero-shot capability.
