Table of Contents
Fetching ...

SemCity: Semantic Scene Generation with Triplane Diffusion

Jumin Lee, Sebin Lee, Changho Jo, Woobin Im, Juhyeong Seon, Sung-Eui Yoon

TL;DR

SemCity addresses the challenge of generating realistic semantic outdoor scenes by learning a diffusion model on a triplane representation that factorizes 3D space into three orthogonal 2D planes. This approach alleviates data sparsity inherent in real outdoor scenes and enables practical down-stream tasks via triplane manipulation, including scene inpainting, outpainting, and semantic scene completion refinement. The method combines a triplane autoencoder with a DDPM-based triplane diffusion model that decodes to semantic labels through an implicit MLP, achieving improved fidelity and diversity on real outdoor data such as SemanticKITTI and CarlaSC. The results demonstrate meaningful scene generation, flexible editing at object and scene scales, and potential for RGB rendering via ControlNet, with code available for replication and further research.

Abstract

We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at https://github.com/zoomin-lee/SemCity.

SemCity: Semantic Scene Generation with Triplane Diffusion

TL;DR

SemCity addresses the challenge of generating realistic semantic outdoor scenes by learning a diffusion model on a triplane representation that factorizes 3D space into three orthogonal 2D planes. This approach alleviates data sparsity inherent in real outdoor scenes and enables practical down-stream tasks via triplane manipulation, including scene inpainting, outpainting, and semantic scene completion refinement. The method combines a triplane autoencoder with a DDPM-based triplane diffusion model that decodes to semantic labels through an implicit MLP, achieving improved fidelity and diversity on real outdoor data such as SemanticKITTI and CarlaSC. The results demonstrate meaningful scene generation, flexible editing at object and scene scales, and potential for RGB rendering via ControlNet, with code available for replication and further research.

Abstract

We present "SemCity," a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object, synthetic indoor scenes, or synthetic outdoor scenes, while the generation of real-world outdoor scenes is rarely addressed. In this paper, we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data, real-outdoor datasets often contain more empty spaces due to sensor limitations, causing challenges in learning real-outdoor distributions. To address this issue, we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore, we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting, scene outpainting, and semantic scene completion refinements. In experimental results, we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset, SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding, removing, or modifying objects within a scene. Further, it also enables the expansion of scenes toward a city-level scale. Finally, we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at https://github.com/zoomin-lee/SemCity.
Paper Structure (46 sections, 10 equations, 16 figures, 3 tables)

This paper contains 46 sections, 10 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: We introduce a diffusion framework, SemCity, designed for generating semantic scenes in real-world outdoor environments as shown in $\text{(a)}$. We extend our diffusion model to various practical tasks: semantic scene completion refinement, scene outpainting, and scene inpainting. For instance, the comprehensive scenario is displayed in $\text{(b)} \! \rightarrow \! \text{(c)} \! \rightarrow \! \text{(d)}$: the refined scene (SSC refinement) $\text{(b)}$ is outpainted to a broader scene $\text{(c)}$; then, an object (in this case, a car) is seamlessly integrated into the scene via our inpainting process $\text{(d)}$.
  • Figure 2: Overview of ours. (a) A 3D semantic map $\mathbf{x}$ is encoded by a triplane encoder $f_\theta$ and factorized to a triplane $\mathbf{h}$. The triplane coupled with a positional encoding $\texttt{PE}(\mathbf{p})$ is decoded by an implicit decoder $g_\theta$, resulting in class probabilities for each coordinate $\mathbf{p}$. (b) Our triplane diffusion model $D_\phi$ learns to generate a novel triplane for semantic scene generation via denoising diffusion process. (c) We further extend our triplane diffusion beyond a simple scene generation toward various practical scenarios by manipulating triplanes in (b).
  • Figure 3: Scene generation results using both real and synthetic outdoor datasets -- SemanticKITTI semantickitti and CarlaSC carlasc. Our results showcase the effective generation of overall structures, including roads and buildings, along with detailed objects such as cars.
  • Figure 4: Higher-resolution scene generation. Building upon our implicit decoder, higher-resolution scene ($1024 \times 1024 \times 128$) can be generated compared with a resolution of training dataset ($256 \times 256 \times 32$).
  • Figure 5: Scene inpainting of our method. The red boxes denote inpainting regions. (a) and (b) show our inpainting examples from reference images.
  • ...and 11 more figures