Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly
Junsheng Zhou, Yu-Shen Liu, Zhizhong Han
TL;DR
The paper addresses reconstructing 3D scenes from a single image without training data. It introduces deep prior assembly, a zero-shot framework that stacks priors from frozen models (Grounded-SAM, Stable Diffusion, Open-CLIP, Shap•E, Omnidata) and optimizes 7-DoF pose/scale using both 2D and 3D cues, plus a robust RANSAC-like strategy. It demonstrates superior performance across synthetic (3D-Front), open-world (Replica/BlendSwap), and real-world (ScanNet) datasets, with thorough ablations illustrating the value of each component. This approach enables flexible, open-world 3D scene reconstruction from a single image without requiring 3D or 2D data-driven training, advancing practical applications in AR/VR and robotics.
Abstract
Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks. In this work, we present deep prior assembly, a novel framework that assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We show that this challenging task can be done without extra knowledge but just simply generalizing one deep prior in one sub-task. To this end, we introduce novel methods related to poses, scales, and occlusion parsing which are keys to enable deep priors to work together in a robust way. Deep prior assembly does not require any 3D or 2D data-driven training in the task and demonstrates superior performance in generalizing priors to open-world scenes. We conduct evaluations on various datasets, and report analysis, numerical and visual comparisons with the latest methods to show our superiority. Project page: https://junshengzhou.github.io/DeepPriorAssembly.
