Table of Contents
Fetching ...

Efficient Text-driven Motion Generation via Latent Consistency Training

Mengxian Hu, Minghao Zhu, Xun Zhou, Qingqing Yan, Shu Li, Chengju Liu, Qijun Chen

TL;DR

This work tackles the inefficiency of text-driven motion diffusion by introducing Motion Latent Consistency Training (MLCT), which precomputes reverse diffusion trajectories in training via a bounded, quantized motion latent space. It combines three innovations: a motion autoencoder with tanh-bounded, finite latent states; a conditionally guided consistency training regime that extends CFG into training-time trajectory optimization; and a clustering guidance module that retrieves distribution references from a KNN-based dictionary. Together, these components enable stable latent consistency training and dramatically reduce inference costs, achieving state-of-the-art or competitive results on KIT and HumanML3D with minimal function evaluations, including near single-step generation. The approach offers practical impact for real-time, text-controlled animation and robotics, with extensible framework potential for other non-pixel modalities. Mathematical constructs such as $z_m=\mathcal{R}(l \cdot \tanh(\mathcal{E}(x)))/l$, $x_\epsilon^{\Phi}$, and $\mathcal{L}_c$ formalize the training dynamics and CFG-based guidance central to the method.

Abstract

Text-driven human motion generation based on diffusion strategies establishes a reliable foundation for multimodal applications in human-computer interactions. However, existing advances face significant efficiency challenges due to the substantial computational overhead of iteratively solving for nonlinear reverse diffusion trajectories during the inference phase. To this end, we propose the motion latent consistency training framework (MLCT), which precomputes reverse diffusion trajectories from raw data in the training phase and enables few-step or single-step inference via self-consistency constraints in the inference phase. Specifically, a motion autoencoder with quantization constraints is first proposed for constructing concise and bounded solution distributions for motion diffusion processes. Subsequently, a classifier-free guidance format is constructed via an additional unconditional loss function to accomplish the precomputation of conditional diffusion trajectories in the training phase. Finally, a clustering guidance module based on the K-nearest-neighbor algorithm is developed for the chain-conduction optimization mechanism of self-consistency constraints, which provides additional references of solution distributions at a small query cost. By combining these enhancements, we achieve stable and consistency training in non-pixel modality and latent representation spaces. Benchmark experiments demonstrate that our method significantly outperforms traditional consistency distillation methods with reduced training cost and enhances the consistency model to perform comparably to state-of-the-art models with lower inference costs.

Efficient Text-driven Motion Generation via Latent Consistency Training

TL;DR

This work tackles the inefficiency of text-driven motion diffusion by introducing Motion Latent Consistency Training (MLCT), which precomputes reverse diffusion trajectories in training via a bounded, quantized motion latent space. It combines three innovations: a motion autoencoder with tanh-bounded, finite latent states; a conditionally guided consistency training regime that extends CFG into training-time trajectory optimization; and a clustering guidance module that retrieves distribution references from a KNN-based dictionary. Together, these components enable stable latent consistency training and dramatically reduce inference costs, achieving state-of-the-art or competitive results on KIT and HumanML3D with minimal function evaluations, including near single-step generation. The approach offers practical impact for real-time, text-controlled animation and robotics, with extensible framework potential for other non-pixel modalities. Mathematical constructs such as , , and formalize the training dynamics and CFG-based guidance central to the method.

Abstract

Text-driven human motion generation based on diffusion strategies establishes a reliable foundation for multimodal applications in human-computer interactions. However, existing advances face significant efficiency challenges due to the substantial computational overhead of iteratively solving for nonlinear reverse diffusion trajectories during the inference phase. To this end, we propose the motion latent consistency training framework (MLCT), which precomputes reverse diffusion trajectories from raw data in the training phase and enables few-step or single-step inference via self-consistency constraints in the inference phase. Specifically, a motion autoencoder with quantization constraints is first proposed for constructing concise and bounded solution distributions for motion diffusion processes. Subsequently, a classifier-free guidance format is constructed via an additional unconditional loss function to accomplish the precomputation of conditional diffusion trajectories in the training phase. Finally, a clustering guidance module based on the K-nearest-neighbor algorithm is developed for the chain-conduction optimization mechanism of self-consistency constraints, which provides additional references of solution distributions at a small query cost. By combining these enhancements, we achieve stable and consistency training in non-pixel modality and latent representation spaces. Benchmark experiments demonstrate that our method significantly outperforms traditional consistency distillation methods with reduced training cost and enhances the consistency model to perform comparably to state-of-the-art models with lower inference costs.
Paper Structure (28 sections, 18 equations, 6 figures, 8 tables)

This paper contains 28 sections, 18 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the distinctions between our method and traditional methods. (a) Traditional diffusion methods calculate diffusion trajectories using the well-trained diffusion model $f_\theta^*$ during inference, incurring high sampling iteration costs. (b) Consistency distillation precomputes the diffusion trajectories in training via the teacher model $f_\theta^*$ and skip-step sampling in inference via self-consistency constraints. (c) Consistency training eliminates reliance on the teacher model $f_\theta^*$ and estimates diffusion trajectories directly from raw data $x_\epsilon$. (d) Our approach extends consistency training to the motion latent space by refining motion representations into bounded and concise distributions, integrating conditional guidance to optimize diffusion trajectories from raw data $x_\epsilon$ via an online-trained unconditional model $f_{\theta,\emptyset}$, and introducing a clustering guidance module to supply solution distribution references for given instruction.
  • Figure 2: Approach overview. (a) Motion sequences are encoded with quantization constraints, ensuring bounded finite states that are structurally analogous to pixel representations. (b) Conditional diffusion trajectories are constructed during the training phase using an online simulation of the CFG format. (c) A clustering guidance module is integrated into the consistency model $\mathcal{S}_\psi$. This module constructs a clustering dictionary using the KNN algorithm and leverages an attention-like query mechanism to provide solution distribution references tailored to the given textual conditions.
  • Figure 3: Comparison with latent consistency distillation frameworks, including the latest proposed MotionLCM and ablation experiments of the proposed method in distillation mode.
  • Figure 4: Qualitative analysis of our model and previous models. Our model demonstrates improved motion generation performance, matching textual conditions with lower inference costs. The color of humans darkens over time.
  • Figure 5: Motion visualizations are sampled from the first three relevant clustering categories, with the # symbol indicating the unique ID of the sample in the training set.
  • ...and 1 more figures