Table of Contents
Fetching ...

GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang

TL;DR

GALA3D tackles the challenge of generating high-fidelity, complex 3D scenes from text by injecting layout priors derived from large language models into a layout-guided Gaussian Splatting framework. It combines adaptive geometry control, a compositional diffusion-based optimization, and a layout refinement module to ensure accurate object interactions and scene-wide consistency. The approach outperforms state-of-the-art methods in both qualitative and quantitative evaluations and supports interactive editing through conversational prompts. The work advances end-to-end, user-friendly text-to-3D scene generation with scalable multi-object capabilities.

Abstract

We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an instance-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. The source codes and models will be available at gala3d.github.io.

GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

TL;DR

GALA3D tackles the challenge of generating high-fidelity, complex 3D scenes from text by injecting layout priors derived from large language models into a layout-guided Gaussian Splatting framework. It combines adaptive geometry control, a compositional diffusion-based optimization, and a layout refinement module to ensure accurate object interactions and scene-wide consistency. The approach outperforms state-of-the-art methods in both qualitative and quantitative evaluations and supports interactive editing through conversational prompts. The work advances end-to-end, user-friendly text-to-3D scene generation with scalable multi-object capabilities.

Abstract

We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation. We first utilize large language models (LLMs) to generate the initial layout and introduce a layout-guided 3D Gaussian representation for 3D content generation with adaptive geometric constraints. We then propose an instance-scene compositional optimization mechanism with conditioned diffusion to collaboratively generate realistic 3D scenes with consistent geometry, texture, scale, and accurate interactions among multiple objects while simultaneously adjusting the coarse layout priors extracted from the LLMs to align with the generated scene. Experiments show that GALA3D is a user-friendly, end-to-end framework for state-of-the-art scene-level 3D content generation and controllable editing while ensuring the high fidelity of object-level entities within the scene. The source codes and models will be available at gala3d.github.io.
Paper Structure (16 sections, 14 equations, 8 figures, 3 tables)

This paper contains 16 sections, 14 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: GALA3D generates high-quality complex 3D scenes and supports interactive controllable editing. Existing methods either produce low-quality textures, visual artifacts, and geometric distortions or fail to accurately generate multiple objects and their interactions according to the text.
  • Figure 2: Overview of our method. Given a textual description, GALA3D first creates a coarse layout using LLMs. The layout is then utilized to construct the Layout-guided Gaussian Representation, incorporating Adaptive Geometry Control to constrain the Gaussians' geometric shape and spatial distribution. Subsequently, Compositional Diffusions are employed to optimize the 3D Guassians using text-to-image priors compositionally. Simultaneously, the Layout Refinement module refines the initial layout provided by LLMs, enabling better adherence to real-world scene constraints.
  • Figure 3: Adaptive Geometry Control for instance Gaussians. Note that the improved Gaussian distribution results in enhanced texture and geometry, as the colors of Gaussians on the surface become more aligned.
  • Figure 4: Layout Refinement. The LLM-generated layouts exhibit spatial misalignment and abnormal scale. We employ Layout Refinement to optimize the layout, resulting in a more aligned layout with the text and the 3D scene.
  • Figure 5: Qualitative comparisons of text-to-3D generation approaches. Our method is capable of generating high-quality single-object, interactive multi-object, and complex composite scenes with high consistency in textual descriptions.
  • ...and 3 more figures