Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

Qirui Wu; Denys Iliash; Daniel Ritchie; Manolis Savva; Angel X. Chang

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

Qirui Wu, Denys Iliash, Daniel Ritchie, Manolis Savva, Angel X. Chang

TL;DR

Diorama presents the first zero shot open world pipeline for holistic CAD based 3D scene modeling from a single image without end to end training. The approach splits the problem into open world perception and CAD based scene modeling, integrating open vocabulary object detection, depth and normal estimation, planar architecture reconstruction, GPT 4o powered scene graphs, and multimodal CAD retrieval with zero shot 9 DoF pose estimation followed by stage wise layout optimization. The method is evaluated on synthetic SSDB, real ScanNet, and in the wild images, showing strong improvements over modular baselines and demonstrating generalization to internet images and text to scene tasks. The results indicate practical potential for scalable, editable 3D scene construction in robotics, AR/VR, and simulation, while highlighting avenues for refinement in geometry and texture fidelity.

Abstract

Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce robust, generalizable solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to internet images and the text-to-scene task.

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

TL;DR

Abstract

Diorama: Unleashing Zero-shot Single-view 3D Indoor Scene Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)