StructLDM: Structured Latent Diffusion for 3D Human Generation
Tao Hu, Fangzhou Hong, Ziwei Liu
TL;DR
StructLDM addresses the limitation of 1D latent spaces in 3D human generation by introducing a semantically structured latent space defined on the SMPL UV surface and coupling it with a structured auto-decoder of local NeRFs and a diffusion model with structure-aware normalization. This two-stage approach enables high-fidelity, view-consistent 3D human generation and rich local editing, including pose/view/shape control and compositional clothing edits, without relying on latent mappings from 1D spaces. Across UBCFashion, RenderPeople, THUman2.0, the method delivers state-of-the-art FID scores and favorable user studies, while enabling novel capabilities such as 3D virtual try-on and part-aware diffusion. By preserving body topology semantics in the latent space and leveraging diffusion priors, StructLDM provides a scalable, controllable framework for realistic 3D human synthesis from 2D data with potential impact on fashion, telepresence, and digital humans.
Abstract
Recent 3D human generative models have achieved remarkable progress by learning 3D-aware GANs from 2D images. However, existing 3D human generative methods model humans in a compact 1D latent space, ignoring the articulated structure and semantics of human body topology. In this paper, we explore more expressive and higher-dimensional latent space for 3D human modeling and propose StructLDM, a diffusion-based unconditional 3D human generative model, which is learned from 2D images. StructLDM solves the challenges imposed due to the high-dimensional growth of latent space with three key designs: 1) A semantic structured latent space defined on the dense surface manifold of a statistical human body template. 2) A structured 3D-aware auto-decoder that factorizes the global latent space into several semantic body parts parameterized by a set of conditional structured local NeRFs anchored to the body template, which embeds the properties learned from the 2D training data and can be decoded to render view-consistent humans under different poses and clothing styles. 3) A structured latent diffusion model for generative human appearance sampling. Extensive experiments validate StructLDM's state-of-the-art generation performance and illustrate the expressiveness of the structured latent space over the well-adopted 1D latent space. Notably, StructLDM enables different levels of controllable 3D human generation and editing, including pose/view/shape control, and high-level tasks including compositional generations, part-aware clothing editing, 3D virtual try-on, etc. Our project page is at: https://taohuumd.github.io/projects/StructLDM/.
