Table of Contents
Fetching ...

JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling

Haorui Ji, Rong Wang, Taojun Lin, Hongdong Li

TL;DR

JADE tackles the challenge of expressive yet controllable 3D human body generation by introducing a joint-aware latent diffusion framework that factorizes geometry into extrinsic joint positions and intrinsic local surface features. A Transformer-based autoencoder maps surface point clouds to joint tokens, and a cascaded diffusion pipeline models $p(\\mathbf{E})$ and $p(\\mathbf{H}|\\mathbf{E})$ to enable coherent generation and flexible editing. Across DFAUST, SPRING, and AMASS, JADE achieves high reconstruction accuracy, interpretable editing, and competitive generation quality, outperforming several parametric and learning-based baselines. The approach provides a principled path toward structure-aware, controllable 3D human generation, with potential extensions to texture synthesis and neural implicit representations.

Abstract

Generative modeling of 3D human bodies have been studied extensively in computer vision. The core is to design a compact latent representation that is both expressive and semantically interpretable, yet existing approaches struggle to achieve both requirements. In this work, we introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures, modeled by joint positions, and local surface geometries, characterized by features attached to each joint. This disentangled latent space design enables geometric and semantic interpretation, facilitating users with flexible controllability. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline where two diffusions are employed to model the distribution of skeleton structures and local surface geometries respectively. Extensive experiments are conducted on public datasets, where we demonstrate the effectiveness of JADE framework in multiple tasks in terms of autoencoding reconstruction accuracy, editing controllability and generation quality compared with existing methods.

JADE: Joint-aware Latent Diffusion for 3D Human Generative Modeling

TL;DR

JADE tackles the challenge of expressive yet controllable 3D human body generation by introducing a joint-aware latent diffusion framework that factorizes geometry into extrinsic joint positions and intrinsic local surface features. A Transformer-based autoencoder maps surface point clouds to joint tokens, and a cascaded diffusion pipeline models and to enable coherent generation and flexible editing. Across DFAUST, SPRING, and AMASS, JADE achieves high reconstruction accuracy, interpretable editing, and competitive generation quality, outperforming several parametric and learning-based baselines. The approach provides a principled path toward structure-aware, controllable 3D human generation, with potential extensions to texture synthesis and neural implicit representations.

Abstract

Generative modeling of 3D human bodies have been studied extensively in computer vision. The core is to design a compact latent representation that is both expressive and semantically interpretable, yet existing approaches struggle to achieve both requirements. In this work, we introduce JADE, a generative framework that learns the variations of human shapes with fined-grained control. Our key insight is a joint-aware latent representation that decomposes human bodies into skeleton structures, modeled by joint positions, and local surface geometries, characterized by features attached to each joint. This disentangled latent space design enables geometric and semantic interpretation, facilitating users with flexible controllability. To generate coherent and plausible human shapes under our proposed decomposition, we also present a cascaded pipeline where two diffusions are employed to model the distribution of skeleton structures and local surface geometries respectively. Extensive experiments are conducted on public datasets, where we demonstrate the effectiveness of JADE framework in multiple tasks in terms of autoencoding reconstruction accuracy, editing controllability and generation quality compared with existing methods.
Paper Structure (20 sections, 9 equations, 5 figures, 3 tables)

This paper contains 20 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our joint-aware latent representation. We dispatch modeling of the overall human body into a sequence of joint tokens, where each token contains extrinsic parameters for encoding skeleton structures, and intrinsic features for modeling local surface geometries.
  • Figure 2: A visual illustration of the autoencoder architecture that is used to train our joint-aware latent representation as well as its training pipeline. (a) The encoder $f_{enc}(\cdot)$, which consists of a tokenization network and a mixing network, maps a mesh human shape $x$ into the extrinsic parameters $E$ and intrinsic features $H$, and the decoder $f_{dec}(\cdot)$ aims to recover the original shape using the paired latents. (b) On top of typical reconstruction loss, we also employ a disentanglement loss to ensure the skeleton structure and the geometric details are independently preserved, as well as a prior loss to regularize a smooth latent space.
  • Figure 3: A visual illustration of the diffusion pipeline, where two cascaded diffusions are presented, one for extrinsic parameters $\mathbf{E} = \{\mathbf{e}_i\}_{i=1}^J$ and the other for intrinsic features $\mathbf{H} = \{\mathbf{h}_i\}_{i=1}^J$. In the first phase, we use a time-conditioned spatial transformer to handle diffusion on set data, and in the second phase, we utilize a DiT to handle more complex conditioning, where the concatenation of encoded timestamp $\gamma(t)$ and extrinsics outputs from phase one $\phi(\mathbf{E})$ is fed to the adaptive normalization layers to modulate the diffusion process.
  • Figure 4: Qualitative visualization results on DFAUST dataset, showing the color coding of the MPVPE error of the reconstructions produced by our JADE framework and baseline methodspavlakos2019expressiveloper2023smplzhou2020unsupervisedsun2023learning. The error maps show that our method has better reconstruction accuracy.
  • Figure 5: Qualitative example of transferring a given human character to a new identity with different body posture through interpolation, indicating that JADE can edit both the extrinsics and intrinsics components in the latent space independently to achieve desired human body movements.