Table of Contents
Fetching ...

SynCity: Training-Free Generation of 3D Worlds

Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi

TL;DR

SynCity introduces a training-free pipeline for generating large, navigable 3D worlds by autoregressively constructing a tile grid. It leverages a combination of language prompting, a 2D image generator with isometric framing, and a 3D generator (TRELLIS) to render individual tiles, then blends them in 2D and 3D latent spaces to form a coherent world. The method includes context-aware prompting, rebasing, geometric validation, and multi-view upsampling, with ablations and human studies showing its effectiveness relative to prior approaches. This approach enables scalable, diverse, and detailed 3D environments without retraining foundational models, with potential applications in gaming, simulation, and virtual reality.

Abstract

We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

SynCity: Training-Free Generation of 3D Worlds

TL;DR

SynCity introduces a training-free pipeline for generating large, navigable 3D worlds by autoregressively constructing a tile grid. It leverages a combination of language prompting, a 2D image generator with isometric framing, and a 3D generator (TRELLIS) to render individual tiles, then blends them in 2D and 3D latent spaces to form a coherent world. The method includes context-aware prompting, rebasing, geometric validation, and multi-view upsampling, with ablations and human studies showing its effectiveness relative to prior approaches. This approach enables scalable, diverse, and detailed 3D environments without retraining foundational models, with potential applications in gaming, simulation, and virtual reality.

Abstract

We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

Paper Structure

This paper contains 45 sections, 3 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: We introduce SynCity, a novel method that can generate from a prompt complex and immersive 3D worlds that can be navigated freely. Our method is training-free and leverages powerful language, 2D and 3D generators via novel prompt engineering strategies.
  • Figure 2: Overview of SynCity. 2D prompting: To generate a new tile, we first render a view of where that tile should be placed, including context from neighbouring tiles. 3D prompting: We extract the new tile image and construct an image prompt for TRELLIS by adding a wider base under the tile. 3D blending: The 3D model that TRELLIS outputs is usually not well blended with the rest of the scene. To address that, we render a view of the new tile next to each neighbouring tile, and inpaint the region between the two with an image inpainting model. Next, we condition using that well-blended view to refine the region between the two 3D tiles. Finally, the new, blended, tile is added to the world.
  • Figure 3: Left: Progressive generation of world tiles $\mathcal{T}$. Right: Isometric framing of a tile for image-based prompting.
  • Figure 4: Left: Generation of the 2D image prompt for the first world tile at $x=0$ and $y=0$. The image generator $\Phi_\text{2D}$ is conditioned on $q = p_{00} \cdot p_\star$ and tasked with inpainting the base image $B$ in the masked region $M$. Right: If we do not 'frame' the image by using $B$ and $M$, the generator produces an image which is not suitable for tiling.
  • Figure 5: Left: Base image $B$ and inpainting mask $M$ (white overlay) to prompt the image generator $\Phi_\text{2D}$ to generate an image for a $x > 0$, $y> 0$ world tile. Right: Result of inpainting.
  • ...and 15 more figures