Efficient Text-driven Motion Generation via Latent Consistency Training
Mengxian Hu, Minghao Zhu, Xun Zhou, Qingqing Yan, Shu Li, Chengju Liu, Qijun Chen
TL;DR
This work tackles the inefficiency of text-driven motion diffusion by introducing Motion Latent Consistency Training (MLCT), which precomputes reverse diffusion trajectories in training via a bounded, quantized motion latent space. It combines three innovations: a motion autoencoder with tanh-bounded, finite latent states; a conditionally guided consistency training regime that extends CFG into training-time trajectory optimization; and a clustering guidance module that retrieves distribution references from a KNN-based dictionary. Together, these components enable stable latent consistency training and dramatically reduce inference costs, achieving state-of-the-art or competitive results on KIT and HumanML3D with minimal function evaluations, including near single-step generation. The approach offers practical impact for real-time, text-controlled animation and robotics, with extensible framework potential for other non-pixel modalities. Mathematical constructs such as $z_m=\mathcal{R}(l \cdot \tanh(\mathcal{E}(x)))/l$, $x_\epsilon^{\Phi}$, and $\mathcal{L}_c$ formalize the training dynamics and CFG-based guidance central to the method.
Abstract
Text-driven human motion generation based on diffusion strategies establishes a reliable foundation for multimodal applications in human-computer interactions. However, existing advances face significant efficiency challenges due to the substantial computational overhead of iteratively solving for nonlinear reverse diffusion trajectories during the inference phase. To this end, we propose the motion latent consistency training framework (MLCT), which precomputes reverse diffusion trajectories from raw data in the training phase and enables few-step or single-step inference via self-consistency constraints in the inference phase. Specifically, a motion autoencoder with quantization constraints is first proposed for constructing concise and bounded solution distributions for motion diffusion processes. Subsequently, a classifier-free guidance format is constructed via an additional unconditional loss function to accomplish the precomputation of conditional diffusion trajectories in the training phase. Finally, a clustering guidance module based on the K-nearest-neighbor algorithm is developed for the chain-conduction optimization mechanism of self-consistency constraints, which provides additional references of solution distributions at a small query cost. By combining these enhancements, we achieve stable and consistency training in non-pixel modality and latent representation spaces. Benchmark experiments demonstrate that our method significantly outperforms traditional consistency distillation methods with reduced training cost and enhances the consistency model to perform comparably to state-of-the-art models with lower inference costs.
