Table of Contents
Fetching ...

SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model

Haowen Zheng, Yanyan Liang

TL;DR

SSEditor is a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling and is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.

Abstract

Recent advancements in 3D diffusion-based semantic scene generation have gained attention. However, existing methods rely on unconditional generation and require multiple resampling steps when editing scenes, which significantly limits their controllability and flexibility. To this end, we propose SSEditor, a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling. SSEditor employs a two-stage diffusion-based framework: (1) a 3D scene autoencoder is trained to obtain latent triplane features, and (2) a mask-conditional diffusion model is trained for customizable 3D semantic scene generation. In the second stage, we introduce a geometric-semantic fusion module that enhance the model's ability to learn geometric and semantic information. This ensures that objects are generated with correct positions, sizes, and categories. Extensive experiments on SemanticKITTI and CarlaSC demonstrate that SSEditor outperforms previous approaches in terms of controllability and flexibility in target generation, as well as the quality of semantic scene generation and reconstruction. More importantly, experiments on the unseen Occ-3D Waymo dataset show that SSEditor is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.

SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model

TL;DR

SSEditor is a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling and is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.

Abstract

Recent advancements in 3D diffusion-based semantic scene generation have gained attention. However, existing methods rely on unconditional generation and require multiple resampling steps when editing scenes, which significantly limits their controllability and flexibility. To this end, we propose SSEditor, a controllable Semantic Scene Editor that can generate specified target categories without multiple-step resampling. SSEditor employs a two-stage diffusion-based framework: (1) a 3D scene autoencoder is trained to obtain latent triplane features, and (2) a mask-conditional diffusion model is trained for customizable 3D semantic scene generation. In the second stage, we introduce a geometric-semantic fusion module that enhance the model's ability to learn geometric and semantic information. This ensures that objects are generated with correct positions, sizes, and categories. Extensive experiments on SemanticKITTI and CarlaSC demonstrate that SSEditor outperforms previous approaches in terms of controllability and flexibility in target generation, as well as the quality of semantic scene generation and reconstruction. More importantly, experiments on the unseen Occ-3D Waymo dataset show that SSEditor is capable of generating novel urban scenes, enabling the rapid construction of 3D scenes.

Paper Structure

This paper contains 16 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Controllable 3D semantic scene generation by SSEditor. The proposed SSEditor enables users to customize the generation or editing of 3D scenes using pre-built mask assets: (a) create a background scene and generate objects on it; (b) eliminate trailing artifacts of dynamic objects in SemanticKITTI behley2019semantickitti; (c) modify roads, such as expanding a two-lane road to a four-lane road; (d) concatenate masks from various scenes to produce a larger-scale 3D scene.
  • Figure 2: Illustration of our SSEditor framework. It comprises two main processes: (a) a 3D autoencoder learns the triplane representation via scene reconstruction, and (b) controllable semantic scene generation is achieved through masks, semantic labels, and tokens. The Geometric-Semantic Fusion Module is essential for the diffusion model to effectively learn both geometric and semantic information.
  • Figure 3: Pipeline of building 3D mask assets. The 3D mask is stored in the corresponding category in the form of a trimask.
  • Figure 4: The details of editing 3D scenes with SSEditor: 1. When the mask of an object is set to 0, the corresponding object can be completely removed. 2. The background can be edited, such as widening roads to simulate heavier traffic. 3. Objects can be added to the edited scene.
  • Figure 5: Visualization of semantic scene generation comparing with SemCity lee2024semcity on SemanticKITTI behley2019semantickitti and CalarSC wilson2022motionsc. Under the guidance of the trimask as a condition, SSEditor demonstrates its strong controllability.
  • ...and 1 more figures