Table of Contents
Fetching ...

WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

Manuel-Andreas Schneider, Angela Dai

Abstract

Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment's geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.

WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

Abstract

Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment's geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.
Paper Structure (38 sections, 3 equations, 9 figures, 2 tables)

This paper contains 38 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: WorldMesh tackles environment-scale 3D scene synthesis by decoupling this complex problem into structure and appearance. From an input text prompt, we first construct a mesh scaffold that establishes the output scene's layout and geometric structure. This mesh is then used as scaffold for conditioned image synthesis to encourage multi-view consistency across both local (e.g., around objects) and global (e.g., across rooms) scales. Synthesized views are optimized into 3D gaussian splats representing a navigable 3D world through view synthesis.
  • Figure 2: Overview of WorldMesh. To generate a complex, multi-room 3D scene from a text prompt, we decompose this problem into first constructing the global scene structure as a mesh scaffold (top), and then using the scaffold mesh as anchor for realistic local appearance (bottom). The text prompt is used to generate a text-based floor plan, which we construct in 3D to use as depth conditioning for an image synthesis model $\Phi$, in order to reconstruct estimated 3D objects in each room. The structural elements and 3D objects constitute the scaffold mesh $\mathcal{M}$, for which initial wall textures are generated with $\Phi$. $\mathcal{M}$ then serves as a geometric anchor for iterative image synthesis using $\Phi$ to generate images $\{I_i\}$. Finally, the output scene $\mathcal{S}$ is optimized with geometry-regularized 3DGS, against both images $\{I_i\}$ and rendered depth from $\mathcal{M}$.
  • Figure 3: Qualitative comparison with baselines. We compare with state-of-the-art methods for 3D scene generation leveraging image, panorama, and video generative model priors zhou2024dreamscene360yang2025layerpano3dchen2024flexworldyu2024wonderworldSpatialGenschneider2025worldexplorer. Since these methods focus on a single-room setting, we show our results focusing on one of our generated rooms within our multi-room generation. Most baselines lack explicit 3D structure, leading to view inconsistencies, particularly for challenging viewpoints close to objects (as views are typically synthesized with wider range for easier scene coverage and consistency). SpatialGen SpatialGen mitigates this with 3D box-based layout conditioning, but struggles to maintain realistic local detail. In contrast, our mesh scaffold anchors synthesis geometrically, enabling much stronger multi-view consistency and more realistic appearance across diverse views.
  • Figure 4: WorldMesh generates a diverse range of complex 3D scenes, varying both in spatial layouts and visual themes. We show rendered views from different rooms within each multi-room generated environment, including transitional viewpoints between rooms, demonstrating that stylistic coherence and 3D consistency are maintained throughout.
  • Figure 5: Ablation of mesh scaffold conditioning. Using only the structural mesh without initial textures (depth only) produces large inconsistencies at the local and object level. Using our scaffold mesh with objects but without initial textures (depth+objects) preserves object coherency, but still exhibits inconsistencies on structural elements. Structural conditioning with initial texture, but with objects represented as bounding boxes (depth+walls+bboxes), does not sufficiently constrain 3D consistency. Our full mesh scaffold with initial texture achieves strong local and global 3D consistency.
  • ...and 4 more figures