Table of Contents
Fetching ...

TRELLISWorld: Training-Free World Generation from Object Generators

Hanke Chen, Yuan Liu, Minchen Li

TL;DR

TRELLISWorld presents a training-free method for text-driven 3D scene generation by repurposing object-level diffusion models as modular tiles. A multi-tile denoising framework with overlapping regions and cosine-based blending enables scalable, coherent world synthesis without scene-level retraining. The approach leverages a two-stage TRELLIS-based pipeline in latent space and demonstrates favorable perceptual alignment, reduced seams, and substantial computational efficiency compared to autoregressive baselines. It supports flexible editing, area-specific prompting, and 3D tiling, offering a simple yet powerful foundation for general-purpose language-guided 3D scene construction. Limitations include dependence on base models and lack of post-generation object disentanglement, suggesting directions for future improvement and extension to broader scene types.

Abstract

Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.

TRELLISWorld: Training-Free World Generation from Object Generators

TL;DR

TRELLISWorld presents a training-free method for text-driven 3D scene generation by repurposing object-level diffusion models as modular tiles. A multi-tile denoising framework with overlapping regions and cosine-based blending enables scalable, coherent world synthesis without scene-level retraining. The approach leverages a two-stage TRELLIS-based pipeline in latent space and demonstrates favorable perceptual alignment, reduced seams, and substantial computational efficiency compared to autoregressive baselines. It supports flexible editing, area-specific prompting, and 3D tiling, offering a simple yet powerful foundation for general-purpose language-guided 3D scene construction. Limitations include dependence on base models and lack of post-generation object disentanglement, suggesting directions for future improvement and extension to broader scene types.

Abstract

Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.

Paper Structure

This paper contains 41 sections, 2 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Scenes generated by our framework, TRELLISWorld, using only natural language input. Users may provide fine-grained prompts for specific regions, enabling semantically consistent gradual transitions: e.g., from a dense commercial district with greens, into low-density residential zones.
  • Figure 2: Illustration of our tiled diffusion process. We first split the scene noise into multiple tiles to denoise each tile in parallel. Then we take the weighted average described in \ref{['eq:weights']} for each tile and aggregate the result to obtain the scene noise for previous timesteps. This process is detailed in \ref{['eq:tiled']}.
  • Figure 3: Top-down views of a generated 4x3x1 scene (not cherry-picked) using (a) an autoregressive method based on inpainting and (b) our method. Our method consistently shows better blending between tiles across different themes.
  • Figure 4: Comparison example with 3x2x1 city chunks for (a) decoding using our tiled decoding method and (b) decoding the entire generation at once. We observe severe artifacts when decoding without our tiled decoder.
  • Figure 5: Comparison (not cherry-picked) showing the effectiveness of blending. (a) With average blending, the "room" example tends to generate walls around tile borders, and the "lego tile" example produces a colored edge along tile borders, which is undesirable. (b) Tile borders become less noticeable with blending.
  • ...and 14 more figures