Table of Contents
Fetching ...

DreamAnywhere: Object-Centric Panoramic 3D Scene Generation

Edoardo Alberto Dominici, Jozef Hladky, Floor Verhoeven, Lukas Radl, Thomas Deixelberger, Stefan Ainetter, Philipp Drescher, Stefan Hauswiesner, Arno Coomans, Giacomo Nazzaro, Konstantinos Vardis, Markus Steinberger

TL;DR

DreamAnywhere tackles the challenge of text-to-3D scene generation by leveraging a $360^ \circ$ panorama as a global prior and decomposing scenes into Background and Object components, which are then fused into a navigable $3DGS$ representation. The method combines a perspective-conditioned panorama diffusion, robust object reconstruction via multimodal resynthesis and NeRF-to-$3DGS$ conversion, and a hybrid 2D/3D inpainting strategy to ensure multi-view coherence. Quantitative results and a user study show superior novel-view coherence and competitive image quality compared to state-of-the-art baselines, with a clear user preference for DreamAnywhere. The modular design facilitates targeted replacements of components, enabling rapid prototyping and practical use in low-budget production workflows while preserving interactive object editing and scene exploration capabilities.

Abstract

Recent advances in text-to-3D scene generation have demonstrated significant potential to transform content creation across multiple industries. Although the research community has made impressive progress in addressing the challenges of this complex task, existing methods often generate environments that are only front-facing, lack visual fidelity, exhibit limited scene understanding, and are typically fine-tuned for either indoor or outdoor settings. In this work, we address these issues and propose DreamAnywhere, a modular system for the fast generation and prototyping of 3D scenes. Our system synthesizes a 360° panoramic image from text, decomposes it into background and objects, constructs a complete 3D representation through hybrid inpainting, and lifts object masks to detailed 3D objects that are placed in the virtual environment. DreamAnywhere supports immersive navigation and intuitive object-level editing, making it ideal for scene exploration, visual mock-ups, and rapid prototyping -- all with minimal manual modeling. These features make our system particularly suitable for low-budget movie production, enabling quick iteration on scene layout and visual tone without the overhead of traditional 3D workflows. Our modular pipeline is highly customizable as it allows components to be replaced independently. Compared to current state-of-the-art text and image-based 3D scene generation approaches, DreamAnywhere shows significant improvements in coherence in novel view synthesis and achieves competitive image quality, demonstrating its effectiveness across diverse and challenging scenarios. A comprehensive user study demonstrates a clear preference for our method over existing approaches, validating both its technical robustness and practical usefulness.

DreamAnywhere: Object-Centric Panoramic 3D Scene Generation

TL;DR

DreamAnywhere tackles the challenge of text-to-3D scene generation by leveraging a panorama as a global prior and decomposing scenes into Background and Object components, which are then fused into a navigable representation. The method combines a perspective-conditioned panorama diffusion, robust object reconstruction via multimodal resynthesis and NeRF-to- conversion, and a hybrid 2D/3D inpainting strategy to ensure multi-view coherence. Quantitative results and a user study show superior novel-view coherence and competitive image quality compared to state-of-the-art baselines, with a clear user preference for DreamAnywhere. The modular design facilitates targeted replacements of components, enabling rapid prototyping and practical use in low-budget production workflows while preserving interactive object editing and scene exploration capabilities.

Abstract

Recent advances in text-to-3D scene generation have demonstrated significant potential to transform content creation across multiple industries. Although the research community has made impressive progress in addressing the challenges of this complex task, existing methods often generate environments that are only front-facing, lack visual fidelity, exhibit limited scene understanding, and are typically fine-tuned for either indoor or outdoor settings. In this work, we address these issues and propose DreamAnywhere, a modular system for the fast generation and prototyping of 3D scenes. Our system synthesizes a 360° panoramic image from text, decomposes it into background and objects, constructs a complete 3D representation through hybrid inpainting, and lifts object masks to detailed 3D objects that are placed in the virtual environment. DreamAnywhere supports immersive navigation and intuitive object-level editing, making it ideal for scene exploration, visual mock-ups, and rapid prototyping -- all with minimal manual modeling. These features make our system particularly suitable for low-budget movie production, enabling quick iteration on scene layout and visual tone without the overhead of traditional 3D workflows. Our modular pipeline is highly customizable as it allows components to be replaced independently. Compared to current state-of-the-art text and image-based 3D scene generation approaches, DreamAnywhere shows significant improvements in coherence in novel view synthesis and achieves competitive image quality, demonstrating its effectiveness across diverse and challenging scenarios. A comprehensive user study demonstrates a clear preference for our method over existing approaches, validating both its technical robustness and practical usefulness.

Paper Structure

This paper contains 24 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Starting from a text prompt, our method generates 3D scenes leveraging a 360° panoramic image as intermediate representation to extract, reconstruct and compose objects in the environment. Our scenes, represented as 3D Gaussian splats, support long-range exploration and have high structural coherence even under large camera offsets. This improves immersive scene generation for existing applications, enables intuitive object-level editing and makes them strong 3D priors for world-to-world transfer models cosmos.
  • Figure 2: An overview of our architecture. We start by generating a 360° panorama, followed by instance segmentation to separate the foreground objects from the scene background. Segmented objects are lifted to 3D using semantic and geometric priors, while a 3D inpainting process converts the image to a 3DGS representation and fills disoccluded regions. Finally, the scene is composited, enabling navigation of the 3D environment at interactive rates.
  • Figure 3: Our image generation pipeline uses a diffusion model to denoise an equirectangular image yang2024layerpano3dfeng2023diffusion360, but additionally combines perspective image conditioning through a decoupled cross-attention layer ye2023ip. We train an equirectangular LoRA jointly with the perspective pre-trained cross attention layer, by encoding random perspective crops of the panorama, showing stronger generalization capabilities to out-of-domain panorama sampling.
  • Figure 4: Our object reconstruction pipeline leverages the panorama and style information to generate a high-resolution reference image to be used for multi-view generation. The generated multi-view images are then transformed into 3D Gaussian splats through a reconstruction pipeline. Finally, we align the generated object with the original and place it in the scene.
  • Figure 5: Qualitative evaluation of our object reconstruction module. Employing multimodal cues faithfully captures the stylistic and structural essence of low-fidelity or incomplete objects, reconstructing a suitable 3D counterpart to be later placed in the environment.
  • ...and 6 more figures