Table of Contents
Fetching ...

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

Wenxun Dai, Ling-Hao Chen, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang

TL;DR

MotionLCM addresses the bottleneck of real-time controllable text-to-motion generation by distilling a motion latent diffusion model into a latent consistency model. It introduces a Motion ControlNet in the latent space and leverages explicit supervision from decoded motion space to train controllable capabilities without sacrificing speed, achieving real-time inference (~30 ms per sequence). Experiments on HumanML3D show MotionLCM provides strong generation quality and robust controllability, outperforming state-of-the-art diffusion-based baselines in speed and, in many cases, in accuracy of alignment with text and control signals. This work enables practical, interactive applications of controllable human motion synthesis.

Abstract

This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial-temporal control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building on the motion latent diffusion model. By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (i.e., initial motions) in the vanilla motion space to further provide supervision for the training process. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

TL;DR

MotionLCM addresses the bottleneck of real-time controllable text-to-motion generation by distilling a motion latent diffusion model into a latent consistency model. It introduces a Motion ControlNet in the latent space and leverages explicit supervision from decoded motion space to train controllable capabilities without sacrificing speed, achieving real-time inference (~30 ms per sequence). Experiments on HumanML3D show MotionLCM provides strong generation quality and robust controllability, outperforming state-of-the-art diffusion-based baselines in speed and, in many cases, in accuracy of alignment with text and control signals. This work enables practical, interactive applications of controllable human motion synthesis.

Abstract

This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial-temporal control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building on the motion latent diffusion model. By adopting one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (i.e., initial motions) in the vanilla motion space to further provide supervision for the training process. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.
Paper Structure (17 sections, 12 equations, 9 figures, 6 tables)

This paper contains 17 sections, 12 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: We propose MotionLCM, a real-time controllable motion latent consistency model. Our model uses the last few frames of the previous motion as temporal control signals to autoregressively generate the next motion in real-time under different text prompts. Green blocks denote the junctions. The numbers in red are the inference time.
  • Figure 2: Comparison of the inference time costs on HumanML3D humanml3d. We compare the AITS and FID metrics with five SOTA methods. The closer the model is to the origin the better. Diffusion-based models are indicated by the blue dashed box. Our MotionLCM achieves real-time inference speed while ensuring high-quality motion generation.
  • Figure 3: The training objective of consistency distillation is to learn a consistency function $\textbf{f}_{\mathbf{\Theta}}$, initialized with the parameters of a pre-trained diffusion model (e.g., MLD mld). This function $\textbf{f}_{\mathbf{\Theta}}$ should projects any points (i.e., $\mathbf{z}_t$) on the ODE trajectory to its solution (i.e., $\mathbf{z}_0$). Once the pre-trained model mld is distilled, unlike the traditional denoising model motiondiffusemdm that requires considerable sampling steps, our MotionLCM can generate high-quality motion sequences with one-step sampling and further improve the generation quality through multi-step inference.
  • Figure 4: The overview of MotionLCM. (a) Motion Latent Consistency Distillation (\ref{['subsection: MotionLCM: Motion Latent Consistency Model']}). Given a raw motion sequence $\mathbf{x}^{1:N}_0$, a pre-trained VAE vae encoder first compresses it into the latent space, then a forward diffusion operation is performed to add $n+k$ steps of noise. Then, the noisy $\mathbf{z}_{n+k}$ is fed into the online network and teacher network to predict the clean latent. The target network takes the $k$-step estimation results of the teacher output to predict the clean latent. To learn self-consistency, a loss is applied to enforce the output of the online network and target network to be consistent. (b) Motion Control in Latent Space (\ref{['subsection: Controllable Motion Generation in Latent Space']}). With the powerful MotionLCM trained in the first stage, we incorporate a motion ControlNet into the MotionLCM to achieve controllable motion generation. Furthermore, we leverage the decoded motion to explicitly supervise the spatial-temporal control signals (i.e., initial poses $\mathbf{g}^{1:\tau}$).
  • Figure 5: Qualitative comparison of the state-of-the-art methods in the text-to-motion task. With only one-step inference, MotionLCM achieves the fastest motion generation while producing high-quality movements that closely match the textual descriptions.
  • ...and 4 more figures