Table of Contents
Fetching ...

MotionPCM: Real-Time Motion Synthesis with Phased Consistency Model

Lei Jiang, Ye Wei, Hao Ni

TL;DR

A phased consistency model-based approach designed to improve the quality and efficiency for real-time motion synthesis in latent space and achieves real-time inference at over 30 frames per second in a single sampling step while outperforming the previous state-of-the-art with a 38.9\% improvement in FID.

Abstract

Diffusion models have become a popular choice for human motion synthesis due to their powerful generative capabilities. However, their high computational complexity and large sampling steps pose challenges for real-time applications. Fortunately, the Consistency Model (CM) provides a solution to greatly reduce the number of sampling steps from hundreds to a few, typically fewer than four, significantly accelerating the synthesis of diffusion models. However, applying CM to text-conditioned human motion synthesis in latent space yields unsatisfactory generation results. In this paper, we introduce \textbf{MotionPCM}, a phased consistency model-based approach designed to improve the quality and efficiency for real-time motion synthesis in latent space. Experimental results on the HumanML3D dataset show that our model achieves real-time inference at over 30 frames per second in a single sampling step while outperforming the previous state-of-the-art with a 38.9\% improvement in FID. The code will be available for reproduction.

MotionPCM: Real-Time Motion Synthesis with Phased Consistency Model

TL;DR

A phased consistency model-based approach designed to improve the quality and efficiency for real-time motion synthesis in latent space and achieves real-time inference at over 30 frames per second in a single sampling step while outperforming the previous state-of-the-art with a 38.9\% improvement in FID.

Abstract

Diffusion models have become a popular choice for human motion synthesis due to their powerful generative capabilities. However, their high computational complexity and large sampling steps pose challenges for real-time applications. Fortunately, the Consistency Model (CM) provides a solution to greatly reduce the number of sampling steps from hundreds to a few, typically fewer than four, significantly accelerating the synthesis of diffusion models. However, applying CM to text-conditioned human motion synthesis in latent space yields unsatisfactory generation results. In this paper, we introduce \textbf{MotionPCM}, a phased consistency model-based approach designed to improve the quality and efficiency for real-time motion synthesis in latent space. Experimental results on the HumanML3D dataset show that our model achieves real-time inference at over 30 frames per second in a single sampling step while outperforming the previous state-of-the-art with a 38.9\% improvement in FID. The code will be available for reproduction.

Paper Structure

This paper contains 24 sections, 1 theorem, 25 equations, 12 figures, 4 tables.

Key Result

Lemma 1

Let $F_\theta(x,t,s)$ be defined in Eq. F_theta, i.e., Then $F_\theta$ allows the equivalent representation:

Figures (12)

  • Figure 1: We propose a new text-conditioned motion synthesis model: MotionPCM, capable of real-time motion generation with improved performance. Lighter colours represent earlier time points.
  • Figure 2: Differences between Consistency/Latent Consistency Models and Phased Consistency Models in multi-step sampling.
  • Figure 3: Comparison of other motion synthesis methods with our method. AITS represents the time required to generate a motion sequence from a textual description. To facilitate display, the $x$-axis is plotted on a logarithmic scale.
  • Figure 4: The pipeline of our proposed MotionPCM. In the training phase, a pre-trained VAE encodes the motion sequence to a latent code $z_0$, which goes $n+k$ diffusion steps to produce $z_{t_{n+k}}$. $z_t{_{n+k}}$ is denoised to $\hat{z}_{t_n}$ through a teacher network and an ODE solver. $\hat{z}_{t_n}$ is passed through a target network to predict $\hat{z}_{s_m}$. Simultaneously, $z_{t_{n+k}}$ is denoised to $\tilde{z}{_{s_m}}$ through the online network directly. A consistency loss within the time interval $[s_m,s_{m+1}]$ is applied by comparing $\tilde{z}{_{s_m}}$ and $\hat{z}_{s_m}$. Additionally, adversarial training is performed by introducing different noises to $\tilde{z}_{s_m}$ and $z_0$, generating $\tilde{z}_s$ and $z_s$ respectively. These are then compared through a discriminator to enforce realism and improve model performance. The trainable components include the online network and the discriminator, whereas the encoder and teacher networks remain frozen during training. The target network is updated using the exponential moving average.
  • Figure 5: Detailed structure of discriminator in our proposed MotionPCM model.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof