Table of Contents
Fetching ...

LEAD: Latent Realignment for Human Motion Diffusion

Nefeli Andreou, Xi Wang, Victoria Fernández Abrevaya, Marie-Paule Cani, Yiorgos Chrysanthou, Vicky Kalogeiton

TL;DR

This work combines latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language and demonstrates improvements in capturing out‐of‐distribution characteristics in comparison to traditional VAEs.

Abstract

Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness; others rely on diffusion models producing impressive motions, but lacking semantic meaning in their latent space. This may compromise realism, diversity, and applicability. Here, we address this by combining latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language. Leveraging this capability, we introduce the task of textual motion inversion to capture novel motion concepts from a few examples. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. Our qualitative analysis and user study reveal that our synthesized motions are sharper, more human-like and comply better with the text compared to modern methods. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.

LEAD: Latent Realignment for Human Motion Diffusion

TL;DR

This work combines latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language and demonstrates improvements in capturing out‐of‐distribution characteristics in comparison to traditional VAEs.

Abstract

Our goal is to generate realistic human motion from natural language. Modern methods often face a trade-off between model expressiveness and text-to-motion alignment. Some align text and motion latent spaces but sacrifice expressiveness; others rely on diffusion models producing impressive motions, but lacking semantic meaning in their latent space. This may compromise realism, diversity, and applicability. Here, we address this by combining latent diffusion with a realignment mechanism, producing a novel, semantically structured space that encodes the semantics of language. Leveraging this capability, we introduce the task of textual motion inversion to capture novel motion concepts from a few examples. For motion synthesis, we evaluate LEAD on HumanML3D and KIT-ML and show comparable performance to the state-of-the-art in terms of realism, diversity, and text-motion consistency. Our qualitative analysis and user study reveal that our synthesized motions are sharper, more human-like and comply better with the text compared to modern methods. For motion textual inversion, our method demonstrates improved capacity in capturing out-of-distribution characteristics in comparison to traditional VAEs.

Paper Structure

This paper contains 19 sections, 14 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: LEAD. Left: Text-to-motion generation with LEAD. LEAD consists of four modules: (1) A motion VAE (blue), a text encoder (green), a diffusion model (brown), and our new projector module (pink). Similar to latent diffusion model (LDM) Roombach:2022, we first train the VAE and then the diffusion model. We then train the projector module (pink) using an alignment loss towards the CLIP embedding, and a reconstruction loss towards the VAE embedding. Once all modules are trained, we generate a motion latent $z^{vae}$ by sampling noise from the Gaussian distribution conditioned on the input text. The resulting latent is then auto-encoded by the projector and decoded through the VAE (blue) to obtain the final motion. Right: Motion textual inversion. A pseudo-word ($C_*$) is added as an additional token, and we seek the optimal embedding $v_*$ to best reproduce the input. Text conditioning guides the generation of motion through the diffusion module (brown). The embedding of the new token is learnt using the reconstruction objective on the realigned space.
  • Figure 2: Left: Qualitative results for T2M compared to the baseline model MLD Chen:2023. Motions generated with our approach are more expressive and less static (a,b), and contain fewer artifacts like foot-sliding (c). Right: Latent space visualization using tSNE Maaten:2008:tSNE.
  • Figure 3: User study information of participants.
  • Figure 4: User study results. We evaluate motion realism and text-motion relevance on a 1-5 scale where a higher score corresponds to better performance (1=very unrealistic/unrelated, 5=very realistic/related). Motions generated with LEAD are consistently perceived as more realistic and relevant than those generated using MLD Chen:2023 (subfig (a) and (b)) and MotionCLIP Tevet:2022:MotionCLIP (subfig (c)).
  • Figure 5: Qualitative comparison between LEAD and MotionCLIP on 4 axes: diversity, global dynamics, realism and semantic alignment.
  • ...and 1 more figures