Table of Contents
Fetching ...

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

Junsheng Zhou, Yu-Shen Liu, Zhizhong Han

TL;DR

The paper addresses reconstructing 3D scenes from a single image without training data. It introduces deep prior assembly, a zero-shot framework that stacks priors from frozen models (Grounded-SAM, Stable Diffusion, Open-CLIP, Shap•E, Omnidata) and optimizes 7-DoF pose/scale using both 2D and 3D cues, plus a robust RANSAC-like strategy. It demonstrates superior performance across synthetic (3D-Front), open-world (Replica/BlendSwap), and real-world (ScanNet) datasets, with thorough ablations illustrating the value of each component. This approach enables flexible, open-world 3D scene reconstruction from a single image without requiring 3D or 2D data-driven training, advancing practical applications in AR/VR and robotics.

Abstract

Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks. In this work, we present deep prior assembly, a novel framework that assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We show that this challenging task can be done without extra knowledge but just simply generalizing one deep prior in one sub-task. To this end, we introduce novel methods related to poses, scales, and occlusion parsing which are keys to enable deep priors to work together in a robust way. Deep prior assembly does not require any 3D or 2D data-driven training in the task and demonstrates superior performance in generalizing priors to open-world scenes. We conduct evaluations on various datasets, and report analysis, numerical and visual comparisons with the latest methods to show our superiority. Project page: https://junshengzhou.github.io/DeepPriorAssembly.

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

TL;DR

The paper addresses reconstructing 3D scenes from a single image without training data. It introduces deep prior assembly, a zero-shot framework that stacks priors from frozen models (Grounded-SAM, Stable Diffusion, Open-CLIP, Shap•E, Omnidata) and optimizes 7-DoF pose/scale using both 2D and 3D cues, plus a robust RANSAC-like strategy. It demonstrates superior performance across synthetic (3D-Front), open-world (Replica/BlendSwap), and real-world (ScanNet) datasets, with thorough ablations illustrating the value of each component. This approach enables flexible, open-world 3D scene reconstruction from a single image without requiring 3D or 2D data-driven training, advancing practical applications in AR/VR and robotics.

Abstract

Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks. In this work, we present deep prior assembly, a novel framework that assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We show that this challenging task can be done without extra knowledge but just simply generalizing one deep prior in one sub-task. To this end, we introduce novel methods related to poses, scales, and occlusion parsing which are keys to enable deep priors to work together in a robust way. Deep prior assembly does not require any 3D or 2D data-driven training in the task and demonstrates superior performance in generalizing priors to open-world scenes. We conduct evaluations on various datasets, and report analysis, numerical and visual comparisons with the latest methods to show our superiority. Project page: https://junshengzhou.github.io/DeepPriorAssembly.

Paper Structure

This paper contains 26 sections, 4 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: An illustration of our work. We assemble diverse deep priors from large models with frozen parameters for scene reconstruction from single images in a zero-shot manner.
  • Figure 2: The overview of deep prior assembly. Given a single image of a 3D scene, we detect the instances and segment them with Grounded-SAM. After normalizing the size and center for the instances, we attempt to amend the quality of the instance images by enhancing and inpainting them. Here, we take a sofa in the image for example. Leveraging the Stable-Diffusion model, we generate a set of candidate images through image-to-image generation, with additional guidance from a text prompt of the instance category predicted by Grounded-SAM. We then filter out the bad generation samples with Open-CLIP by evaluating the cosine similarity between the generated instances and original one. After that, we generate multiple 3D model proposals for this instance with Shap$\cdot$E from the Top-$K$ generated instance images. Additionally, we estimate the depth of the origin input image with Omnidata as a 3D geometry prior. To estimate the layout, we propose an approach to optimize the location, orientation and scale for each 3D proposal by matching it with the estimated segmentation masks and the depths (the $\star$ for the example sofa). Finally, we choose the 3D model proposal with minimal matching error as the final prediction of this instance, and the final scene is generated by combining the generated 3D models for all detected instances.
  • Figure 3: Examples on the effect of our pipeline. For the corrupted 2D instant segmented from the scene image, we leverage Stable-Diffusion to produce $6$ amended generations. We then adopt Open-CLIP to filter out bad samples by judging the similarities and producing confidence scores for the generations, and keep the Top-$3$ generated images. The shape generations with Shap$\cdot$E from the amended images are significantly more complete and accurate than the one produced by the original corrupted image.
  • Figure 4: Illustration of the depth transform. The estimated depth maps from Omnidata is not scale-aware, resulting in scale inaccuracies and distortion in the back-projected depth point clouds. We achieve the accurate depth point cloud by first transforming the depth maps with the pre-solved scale and shift before back-projecting.
  • Figure 5: Effect of the 2D Matching. An example of optimizing the pose and scale for a chair. We visualize the optimization in 2D space. The red 2D points indicate the dense 2D point cloud sampled in the mask, which is the target. And the green 2D points donate the 2D projection of transformed 3D point clouds sampled from the generated shape of this chair instance. More robust registration is achieved with the proposed 2D matching constraint. The total 1,000 iterations take $9.2$ seconds on a single 3090 GPU.
  • ...and 11 more figures