Table of Contents
Fetching ...

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Zichen Geng, Zeeshan Hayder, Bo Miao, Jian Liu, Wei Liu, Ajmal Mian

TL;DR

This work proposes Disentangled Hierarchical Variational Autoencoder based latent diffusion for structured and controllable HHI generation and incorporates contrastive learning constraints with DHVAE to mitigate implausible and physically inconsistent contacts in HHI.

Abstract

Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

TL;DR

This work proposes Disentangled Hierarchical Variational Autoencoder based latent diffusion for structured and controllable HHI generation and incorporates contrastive learning constraints with DHVAE to mitigate implausible and physically inconsistent contacts in HHI.

Abstract

Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.
Paper Structure (33 sections, 19 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 33 sections, 19 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) InterLDM two-in-one, (b) InterMask intermask encode all motion information into a single latent. (c) Our encodes individual motions and interactions into separate disentangled latents.
  • Figure 2: Architecture of our DHVAE to encode the structured latent representation $\mathbf{z}_o, \mathbf{z}_a, \mathbf{z}_b$. The global latent token $\mathbf{z}_o$ will learn an interaction plausible space via contrastive learning. The encoded structured representation will be passed into a skip-connected AdaLN Transformer to learn the denoise process.
  • Figure 3: Qualitative Comparison with InterMask intermask on InterHuman Dataset, indicating superior text alignment, fidelity, and physical plausibility. The body meshes are arranged sequentially from left to right, with colors progressing from light to dark.
  • Figure 4: User Study of DHVAE compared to InterMask and TIMotion
  • Figure 5: Metrics along the change along classifier-free-guidance scale on InterHuman dataset
  • ...and 8 more figures