Table of Contents
Fetching ...

Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

Stefan Ainetter, Thomas Deixelberger, Edoardo A. Dominici, Philipp Drescher, Konstantinos Vardis, Markus Steinberger

Abstract

We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360° imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.

Scene Generation at Absolute Scale: Utilizing Semantic and Geometric Guidance From Text for Accurate and Interpretable 3D Indoor Scene Generation

Abstract

We present GuidedSceneGen, a text-to-3D generation framework that produces metrically accurate, globally consistent, and semantically interpretable indoor scenes. Unlike prior text-driven methods that often suffer from geometric drift or scale ambiguity, our approach maintains an absolute world coordinate frame throughout the entire generation process. Starting from a textual scene description, we predict a global 3D layout encoding both semantic and geometric structure, which serves as a guiding proxy for downstream stages. A semantics- and depth-conditioned panoramic diffusion model then synthesizes 360° imagery aligned with the global layout, substantially improving spatial coherence. To explore unobserved regions, we employ a video diffusion model guided by optimized camera trajectories that balances coverage and collision avoidance, achieving up to 10x faster sampling compared to exhaustive path exploration. The generated views are fused using 3D Gaussian Splatting, yielding a consistent and fully navigable 3D scene in absolute scale. GuidedSceneGen enables accurate transfer of object poses and semantic labels from layout to reconstruction, and supports progressive scene expansion without re-alignment. Quantitative results and a user study demonstrate greater 3D consistency and layout plausibility compared to recent panoramic text-to-3D baselines.
Paper Structure (51 sections, 9 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 51 sections, 9 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 2: Method Overview. Given a text prompt describing an indoor scene, we first estimate a 3D global scene layout . From this, we derive the guidance signals for panorama generation in the form of depth and semantic renderings. Our guided panoramic image generator utilizes these renderings to generate a consistent and realistic RGB panoramic image that accurately aligns with the 3D global scene layout. As shown in , we then derive optimized camera trajectories for novel view synthesis (NVS) to cover unobserved areas, with the focus on avoiding collisions of the camera trajectory with the objects in the scene. Finally, we directly utilize all generated RGB frames and the corresponding metric scale camera poses to reconstruct the 3D scene in the absolute world coordinate frame .
  • Figure 3: Visual comparison of 3D reconstruction results for different methods. The 2D renderings show comparable viewpoints for different methods. Our method generates more convincing and complete geometries compared to other methods. We refer to the supplemental for additional visualizations.
  • Figure 4: Scene bounding comparison of 3D reconstruction results for different methods. Our 3DGS point cloud follows the scene geometry accurately, while other methods have significant distortions and floaters. Note that 3D scenes of competitor methods are in relative scale and have been manually re-scaled for better visualization. In contrast, our method provides results in absolute metric scale.
  • Figure 5: Visualization of different NVS trajectories.Left: We utilize MapAnything keetha2025mapanything and compare the predicted camera poses for different scale parameters. Using our strategy, the predicted camera poses are closely aligned with the ground truth. Right: Our optimal scale parameter ensures that the estimated camera trajectories avoid collisions with walls and objects, leading to high quality novel views. In contrast, using default scale parameters can lead to degenerate frames or low scene coverage.
  • Figure 6: Visualization of our annotations. 9D object poses and 2D semantic masks transferred from the 3D proxy align well with novel views of the final 3DGS scene and enable reliable 3D Gaussian segmentation via clustering. Compared to predictions of SegFormer xie2021segformer, a specialist model trained for 2D indoor scene segmentation, our transferred annotations are more precise.
  • ...and 5 more figures