Table of Contents
Fetching ...

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

Qirui Wu, Denys Iliash, Daniel Ritchie, Manolis Savva, Angel X. Chang

TL;DR

Diorama presents the first zero shot open world pipeline for holistic CAD based 3D scene modeling from a single image without end to end training. The approach splits the problem into open world perception and CAD based scene modeling, integrating open vocabulary object detection, depth and normal estimation, planar architecture reconstruction, GPT 4o powered scene graphs, and multimodal CAD retrieval with zero shot 9 DoF pose estimation followed by stage wise layout optimization. The method is evaluated on synthetic SSDB, real ScanNet, and in the wild images, showing strong improvements over modular baselines and demonstrating generalization to internet images and text to scene tasks. The results indicate practical potential for scalable, editable 3D scene construction in robotics, AR/VR, and simulation, while highlighting avenues for refinement in geometry and texture fidelity.

Abstract

Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce robust, generalizable solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to internet images and the text-to-scene task.

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

TL;DR

Diorama presents the first zero shot open world pipeline for holistic CAD based 3D scene modeling from a single image without end to end training. The approach splits the problem into open world perception and CAD based scene modeling, integrating open vocabulary object detection, depth and normal estimation, planar architecture reconstruction, GPT 4o powered scene graphs, and multimodal CAD retrieval with zero shot 9 DoF pose estimation followed by stage wise layout optimization. The method is evaluated on synthetic SSDB, real ScanNet, and in the wild images, showing strong improvements over modular baselines and demonstrating generalization to internet images and text to scene tasks. The results indicate practical potential for scalable, editable 3D scene construction in robotics, AR/VR, and simulation, while highlighting avenues for refinement in geometry and texture fidelity.

Abstract

Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce robust, generalizable solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to internet images and the text-to-scene task.

Paper Structure

This paper contains 27 sections, 3 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: We propose Diorama: a system for zero-shot single-view 3D scene modeling. Our system produces a holistic 3D scene model given a single input image, representing both architectural elements and objects in cluttered indoor scenes. We showcase three examples of our system where the input image is shown in the top left in each case. Other images are renderings of the output 3D scene from different camera viewpoints. (a) 3D scene model from a synthetic image input. (b) 3D scene model from a real-world internet image input. (c) Text-to-scene pipeline where the input image on the left is generated from a text prompt.
  • Figure 2: Illustration of the Diorama pipeline. The input image is processed in the open-world perception component (in orange box) through object instance segmentation, depth and normal estimation, architecture reconstruction and LLM-powered scene graph generation. The CAD-based scene modeling component (green box) then assembles a compositional 3D scene by retrieving and posing objects from a database and optimizing the overall scene layout. Multiple plausible scene arrangement hypotheses are produced as outputs.
  • Figure 3: Our PlainRecon architecture reconstruction approach. Objects are first segmented. The object masks are inpainted, and depth and normals are estimated. Then, normal-based clustering on the point clouds is used to produce the 3D architecture.
  • Figure 4: Our zero-shot object pose estimation approach. We leverage vision transformer features to establish 2D and 3D correspondences and estimate 9-DoF poses for each object.
  • Figure 5: Our semantic-aware scene layout optimization. In this example a stack of books is placed on a side table.
  • ...and 13 more figures