StructLDM: Structured Latent Diffusion for 3D Human Generation

Tao Hu; Fangzhou Hong; Ziwei Liu

StructLDM: Structured Latent Diffusion for 3D Human Generation

Tao Hu, Fangzhou Hong, Ziwei Liu

TL;DR

StructLDM addresses the limitation of 1D latent spaces in 3D human generation by introducing a semantically structured latent space defined on the SMPL UV surface and coupling it with a structured auto-decoder of local NeRFs and a diffusion model with structure-aware normalization. This two-stage approach enables high-fidelity, view-consistent 3D human generation and rich local editing, including pose/view/shape control and compositional clothing edits, without relying on latent mappings from 1D spaces. Across UBCFashion, RenderPeople, THUman2.0, the method delivers state-of-the-art FID scores and favorable user studies, while enabling novel capabilities such as 3D virtual try-on and part-aware diffusion. By preserving body topology semantics in the latent space and leveraging diffusion priors, StructLDM provides a scalable, controllable framework for realistic 3D human synthesis from 2D data with potential impact on fashion, telepresence, and digital humans.

Abstract

Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc. Our project page is at: https://taohuumd.github.io/projects/StructLDM/.

StructLDM: Structured Latent Diffusion for 3D Human Generation

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 20 figures, 8 tables)

This paper contains 24 sections, 8 equations, 20 figures, 8 tables.

Introduction
Related Work
Our Approach
Structured 3D Human Representation
Structured Auto-decoder
Joint Learning of Auto-decoder
Structured Latent Diffusion Model
Experiments
Experimental Setup
Comparisons to SOTA Methods
Ablation Study
Controllable Human Generation and Editing
Discussion
Implementation
Network Architecture
...and 9 more sections

Figures (20)

Figure 1: StructLDM generates diverse view-consistent humans, and supports different levels of controllable generations and editings, such as compositional generations by blending the five selected parts from a), and part-aware editings such as identity swapping, local clothing editing, 3D virtual try-on, etc. Note that the generations and editing are clothing-agnostic without clothing types or masks.
Figure 2: Two-stage framework. In Stage 1, given a training dataset containing various human subject images with estimated SMPL and camera parameters distribution $p_{est}$, an auto-decoder is learned to optimize the structured latent $z \in \mathcal{Z}$ for each training subject. Each latent is rendered into a pose- and view-dependent image by a structured volumetric renderer $G_1$ and a global style mixer module (GM) $G_2$. In Stage 2, the auto-decoder parameters are frozen and the learned structured latent $\mathcal{Z}$ are then used to train the latent diffusion model. At inference time, latents are randomly sampled and decoded by $G_2 \circ G_1$ for human rendering.
Figure 3: Qualitative results on UBCFashion. We generate diverse view-consistent humans under different poses/views for different clothing styles (e.g. dress) and hairstyles.
Figure 4: Qualitative comparisons on RenderPeople renderpeople. The geometry is visualized as normal/depth maps at $128\times64$ resolution, and images are cropped to $512\times256$ for visualization. We synthesize high-quality faces ③④⑤ vs. ①② PrimDiff chen2023primdiffusion.
Figure 5: Comparisons on THUman2.0 thuman2. The geometry is visualized as normal/depth maps at $128\times64$ resolution.
...and 15 more figures

StructLDM: Structured Latent Diffusion for 3D Human Generation

TL;DR

Abstract

StructLDM: Structured Latent Diffusion for 3D Human Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (20)