Table of Contents
Fetching ...

Length-Aware Motion Synthesis via Latent Diffusion

Alessio Sampieri, Alessio Palma, Indro Spinelli, Fabio Galasso

TL;DR

This work tackles length-aware text-conditioned 3D human motion synthesis by introducing Length-Aware Latent Diffusion (LADiff), which jointly learns a length-aware VAE and a latent diffusion process that adapts to target sequence duration. The latent space is organized into subspaces that activate with increasing length, and diffusion operates over a variable-dimensional latent representation, aided by a Denoising VAE for robustness. Empirical results on HumanML3D and KIT-ML show state-of-the-art performance across multiple metrics, with ablations confirming the effectiveness of length-aware subspaces, DVAE augmentation, and the overall architecture. The approach enables accurate control of motion duration while preserving textual fidelity and motion realism, offering practical benefits for animators and robotics alike.

Abstract

The target duration of a synthesized human motion is a critical attribute that requires modeling control over the motion dynamics and style. Speeding up an action performance is not merely fast-forwarding it. However, state-of-the-art techniques for human behavior synthesis have limited control over the target sequence length. We introduce the problem of generating length-aware 3D human motion sequences from textual descriptors, and we propose a novel model to synthesize motions of variable target lengths, which we dub "Length-Aware Latent Diffusion" (LADiff). LADiff consists of two new modules: 1) a length-aware variational auto-encoder to learn motion representations with length-dependent latent codes; 2) a length-conforming latent diffusion model to generate motions with a richness of details that increases with the required target sequence length. LADiff significantly improves over the state-of-the-art across most of the existing motion synthesis metrics on the two established benchmarks of HumanML3D and KIT-ML.

Length-Aware Motion Synthesis via Latent Diffusion

TL;DR

This work tackles length-aware text-conditioned 3D human motion synthesis by introducing Length-Aware Latent Diffusion (LADiff), which jointly learns a length-aware VAE and a latent diffusion process that adapts to target sequence duration. The latent space is organized into subspaces that activate with increasing length, and diffusion operates over a variable-dimensional latent representation, aided by a Denoising VAE for robustness. Empirical results on HumanML3D and KIT-ML show state-of-the-art performance across multiple metrics, with ablations confirming the effectiveness of length-aware subspaces, DVAE augmentation, and the overall architecture. The approach enables accurate control of motion duration while preserving textual fidelity and motion realism, offering practical benefits for animators and robotics alike.

Abstract

The target duration of a synthesized human motion is a critical attribute that requires modeling control over the motion dynamics and style. Speeding up an action performance is not merely fast-forwarding it. However, state-of-the-art techniques for human behavior synthesis have limited control over the target sequence length. We introduce the problem of generating length-aware 3D human motion sequences from textual descriptors, and we propose a novel model to synthesize motions of variable target lengths, which we dub "Length-Aware Latent Diffusion" (LADiff). LADiff consists of two new modules: 1) a length-aware variational auto-encoder to learn motion representations with length-dependent latent codes; 2) a length-conforming latent diffusion model to generate motions with a richness of details that increases with the required target sequence length. LADiff significantly improves over the state-of-the-art across most of the existing motion synthesis metrics on the two established benchmarks of HumanML3D and KIT-ML.
Paper Structure (19 sections, 7 equations, 6 figures, 5 tables)

This paper contains 19 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A pictorial illustration of the proposed LADiff generative model. The green latent space, learned via VAE, is subdivided into subspaces that activate progressively for longer target human motion sequences, i.e., the shortest sequence latent vectors lie on the 1D line, those for longer sequences lie on planes, cubes, or higher dimensions. Correspondingly, during the sequence generation via a latent DDPM, longer sequences learn attention patterns made up of more subspaces, i.e., more columns in the rectangles of latent-to-frame attention vectors. See Secs. \ref{['sec:intro']} and \ref{['sec:discussion']} for more details. As a result shorter sequences depict faster actions, adopting the styles of motion that take the fewest frames. Longer sequences accommodate more frames with the longer version of the actions in terms of dynamics and style. We provide videos associated with the images in the paper in the additional materials.
  • Figure 2: Overview of our proposed Length-Aware Latent Diffusion (LADiff). During the reconstruction phase (orange arrows), the Encoder, aided by the Decoder, learns to represent sequences of varying lengths into a latent space composed of subspaces, which activate progressively for longer sequences. In the Generation stage (blue arrows), the Denoiser learns to create latent vectors aligned to the textual input, which map to correct subspaces specified by input sequence length. The actual motion results from decoding the latent vectors of the Denoiser. For this purpose, the Decoder is made resilient to noise in the Reconstruction stage by learning to reconstruct sequences affected by noise
  • Figure 3: Qualitative comparison of text-based human motion generations. multiple target lengths are provided to techniques that allow to set the length input. See Section \ref{['sec:qualitative']} for discussion.
  • Figure 4: (Rows) Generation of motions with the same textual input and varying queried target lengths, alongside (Top) the corresponding decoder transformer attention maps where $y-$axis represents the target motion length. Darker colors indicate higher attention scores for each subspace of the latent vector, represented in chunks along the $x-$axis. See Sec \ref{['sec:ablation']} for the discussion.
  • Figure 5: The first row shows the decoder's attention map on the length-aware, denoised latent vectors. Then we depict the generated motion obtained using only the activated latent subspaces selected in red. See Sec \ref{['sec:discussion']} for the detailed description.
  • ...and 1 more figures