TRELLISWorld: Training-Free World Generation from Object Generators
Hanke Chen, Yuan Liu, Minchen Li
TL;DR
TRELLISWorld presents a training-free method for text-driven 3D scene generation by repurposing object-level diffusion models as modular tiles. A multi-tile denoising framework with overlapping regions and cosine-based blending enables scalable, coherent world synthesis without scene-level retraining. The approach leverages a two-stage TRELLIS-based pipeline in latent space and demonstrates favorable perceptual alignment, reduced seams, and substantial computational efficiency compared to autoregressive baselines. It supports flexible editing, area-specific prompting, and 3D tiling, offering a simple yet powerful foundation for general-purpose language-guided 3D scene construction. Limitations include dependence on base models and lack of post-generation object disentanglement, suggesting directions for future improvement and extension to broader scene types.
Abstract
Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.
