JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling
Haorui Ji, Rong Wang, Taojun Lin, Hongdong Li
TL;DR
JADE tackles the challenge of expressive yet controllable 3D human body generation by introducing a joint-aware latent diffusion framework that factorizes geometry into extrinsic joint positions and intrinsic local surface features. A Transformer-based autoencoder maps surface point clouds to joint tokens, and a cascaded diffusion pipeline models $p(\\mathbf{E})$ and $p(\\mathbf{H}|\\mathbf{E})$ to enable coherent generation and flexible editing. Across DFAUST, SPRING, and AMASS, JADE achieves high reconstruction accuracy, interpretable editing, and competitive generation quality, outperforming several parametric and learning-based baselines. The approach provides a principled path toward structure-aware, controllable 3D human generation, with potential extensions to texture synthesis and neural implicit representations.
Abstract
Generative modeling of 3D human bodies have been studied extensively in computer vision. The core is to design a compact latent representation that is both expressive and semantically interpretable, yet existing approaches struggle to achieve both requirements. In this work, we introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures, modeled by joint positions, and local surface geometries, characterized by features attached to each joint. This disentangled latent space design enables geometric and semantic interpretation, facilitating users with flexible controllability. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline where two diffusions are employed to model the distribution of skeleton structures and local surface geometries respectively. Extensive experiments are conducted on public datasets, where we demonstrate the effectiveness of JADE framework in multiple tasks in terms of autoencoding reconstruction accuracy, editing controllability and generation quality compared with existing methods.
