Table of Contents
Fetching ...

LT3SD: Latent Trees for 3D Scene Diffusion

Quan Meng, Lei Li, Matthias Nießner, Angela Dai

TL;DR

<3-5 sentence high-level summary> LT3SD tackles the challenge of generating large-scale, coherent 3D scenes with high fidelity. It introduces a latent-tree representation that decouples geometry (TUDF) from high-frequency details and trains patch-based diffusion models at each level to synthesize scenes in a coarse-to-fine manner. The approach enables infinite scene generation and probabilistic completion, and experiments show substantial improvements over baselines on 3D-FRONT data in both quality and diversity, including novel scene patches. This work advances scalable, open-world 3D content creation for games and simulations.

Abstract

We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.

LT3SD: Latent Trees for 3D Scene Diffusion

TL;DR

<3-5 sentence high-level summary> LT3SD tackles the challenge of generating large-scale, coherent 3D scenes with high fidelity. It introduces a latent-tree representation that decouples geometry (TUDF) from high-frequency details and trains patch-based diffusion models at each level to synthesize scenes in a coarse-to-fine manner. The approach enables infinite scene generation and probabilistic completion, and experiments show substantial improvements over baselines on 3D-FRONT data in both quality and diversity, including novel scene patches. This work advances scalable, open-world 3D content creation for games and simulations.

Abstract

We present LT3SD, a novel latent diffusion model for large-scale 3D scene generation. Recent advances in diffusion models have shown impressive results in 3D object generation, but are limited in spatial extent and quality when extended to 3D scenes. To generate complex and diverse 3D scene structures, we introduce a latent tree representation to effectively encode both lower-frequency geometry and higher-frequency detail in a coarse-to-fine hierarchy. We can then learn a generative diffusion process in this latent 3D scene space, modeling the latent components of a scene at each resolution level. To synthesize large-scale scenes with varying sizes, we train our diffusion model on scene patches and synthesize arbitrary-sized output 3D scenes through shared diffusion generation across multiple scene patches. Through extensive experiments, we demonstrate the efficacy and benefits of LT3SD for large-scale, high-quality unconditional 3D scene generation and for probabilistic completion for partial scene observations.
Paper Structure (25 sections, 8 equations, 11 figures, 3 tables)

This paper contains 25 sections, 8 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: We introduce LT3SD, a novel latent 3D scene diffusion approach enabling high-fidelity generation of infinite 3D environments. We train LT3SD on a latent tree-based 3D scene representation, encoding both lower-frequency geometry and higher-frequency detail, and synthesize infinite scenes in a patch-by-patch and coarse-to-fine fashion.
  • Figure 2: Latent representations for 3D scenes. In contrast to encoding a 3D scene into a single latent feature grid or a multi-level latent pyramid on the left, our latent tree representation on the right is a learned hierarchical decomposition with a series of geometry (lower-frequency) and latent feature (higher-frequency) encodings at each resolution level.
  • Figure 3: Overview of LT3SD. We formulate 3D scene generation as a patch-based latent diffusion process. Left: To characterize complex scene geometry, we encode 3D scenes in a novel latent tree representation, where each scene resolution level $i \in [1, N-1]$ is decomposed into a TUDF grid $L_{i}^s$ and a latent feature grid $H_{i}^s$. Top Right: During latent tree training, the encoder $\mathcal{E}$ encodes a patch $L_{i+1}$ from the scene grid $L_{i+1}^s$ at resolution level $i+1$ to a coarser TUDF patch $L_i$ and a latent feature patch $H_i$ at level $i$. The decoder $\mathcal{D}$ then reconstructs the scene patch $L_{i+1}$ based on the factorized grids $L_i$ and $H_i$. Bottom Right: During generation, the diffusion model $\mathcal{G}$ learns to generate a latent feature patch $H_i$ conditioned on a TUDF patch $L_i$ within the same level $i$. Our method enables arbitrary-sized 3D scene generation at inference time by synthesizing scenes in a coarse-to-fine hierarchy and a patch-by-patch fashion.
  • Figure 4: Qualitative comparison. We compare unconditional 3D scene generation with diverse 3D diffusion methods PVD Zhou_2021_ICCV, NFD shue20233d, BlockFusion wu2024blockfusion, and XCube ren2023xcube. All methods were trained on houses from the 3D-FRONT dataset fu20213dfront. Our latent tree-based 3D scene diffusion approach synthesizes cleaner surfaces with more geometric details and captures diverse furniture objects.
  • Figure 5: Generation novelty analysis. Our generated scene patches (left), compared with their nearest-neighbor retrieved training patches by Chamfer distance. Our approach can synthesize novel patches with different geometric structures than their training set neighbors.
  • ...and 6 more figures