Table of Contents
Fetching ...

Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

Han Yan, Yang Li, Zhennan Wu, Shenzhou Chen, Weixuan Sun, Taizhang Shang, Weizhe Liu, Tian Chen, Xiaqiang Dai, Chao Ma, Hongdong Li, Pan Ji

TL;DR

Frankenstein addresses the challenge of generating semantic-decomposed, multi-part 3D scenes with diffusion. It introduces a tri-plane diffusion framework that decodes multiple class-specific SDFs from a single tri-plane, enabling simultaneous, complete generation of semantic parts. The method uses a three-stage pipeline—tri-plane fitting, a VAE to compress to a latent space, and conditional diffusion conditioned on layout maps—applied to room interiors and compositional avatars. The results demonstrate plausible, separable geometry with practical editing capabilities such as part-wise texturing and cloth re-targeting, offering a scalable approach for semantically structured 3D content creation.

Abstract

We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting. Our project page is available at: https://wolfball.github.io/frankenstein/.

Frankenstein: Generating Semantic-Compositional 3D Scenes in One Tri-Plane

TL;DR

Frankenstein addresses the challenge of generating semantic-decomposed, multi-part 3D scenes with diffusion. It introduces a tri-plane diffusion framework that decodes multiple class-specific SDFs from a single tri-plane, enabling simultaneous, complete generation of semantic parts. The method uses a three-stage pipeline—tri-plane fitting, a VAE to compress to a latent space, and conditional diffusion conditioned on layout maps—applied to room interiors and compositional avatars. The results demonstrate plausible, separable geometry with practical editing capabilities such as part-wise texturing and cloth re-targeting, offering a scalable approach for semantically structured 3D content creation.

Abstract

We present Frankenstein, a diffusion-based framework that can generate semantic-compositional 3D scenes in a single pass. Unlike existing methods that output a single, unified 3D shape, Frankenstein simultaneously generates multiple separated shapes, each corresponding to a semantically meaningful part. The 3D scene information is encoded in one single tri-plane tensor, from which multiple Singed Distance Function (SDF) fields can be decoded to represent the compositional shapes. During training, an auto-encoder compresses tri-planes into a latent space, and then the denoising diffusion process is employed to approximate the distribution of the compositional scenes. Frankenstein demonstrates promising results in generating room interiors as well as human avatars with automatically separated parts. The generated scenes facilitate many downstream applications, such as part-wise re-texturing, object rearrangement in the room or avatar cloth re-targeting. Our project page is available at: https://wolfball.github.io/frankenstein/.
Paper Structure (13 sections, 11 equations, 15 figures, 7 tables)

This paper contains 13 sections, 11 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Training pipeline of Frankenstein. Tri-plane fitting: training scenes are converted into tri-planes. VAE training: tri-planes are compressed into latent tri-planes via an auto-encoder. Conditional denoising: the distributions of latent tri-planes are approximated by a diffusion model conditioned on layout maps. During the inference process, given a 2D layout, the diffusion model denoises the noise to produce a latent tri-plane. This latent tri-plane is subsequently upsampled to a higher resolution by the VAE. Finally, a lightweight MLP decodes the high-resolution tri-plane into multiple semantic-wise SDFs.
  • Figure 2: Two approaches to incorporate semantic information into neural fields.
  • Figure 3: Interpolation between two rooms on tri-plane space and latent tri-plane space.
  • Figure 4: Qualitative room generation results. The prompt for Text2Room hollein2023text2room is "a wooden style bedroom with a king-size bed and large wardrobes." The textures of CommonScenes' zhai2023commonscenes and ours are generated using SyncMVD liu2023text based on prompt "wooden". The geometry of CC3D bahmani2023cc3d is reconstructed from a point cloud extracted from depth images using ball-pivoting.
  • Figure 5: Applications of semantic-compositional room generation. Each component (wall/bed/cabinet) is textured using chen2023text2tex with wooden, graffiti, shabby and Chinese styles. Rearrangement can also be applied.
  • ...and 10 more figures