Table of Contents
Fetching ...

MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi, Yuki Mitsufuji

TL;DR

MoLA tackles the gap in text-to-motion systems by delivering fast, high-quality generation with variable lengths and versatile editing in a single framework. It combines a VAE-GAN with adversarial training (employing a SAN-based discriminator) to learn a compact motion latent space, and a text-conditioned latent diffusion model to generate motions from descriptions. Guided, training-free editing via MPGD enables path-following, in-betweening, and upper-body edits without retraining, while an activation variable enables variable-length outputs aligned with textual input. Together, these components yield state-of-the-art performance among continuous latent methods on benchmarks like HumanML3D and KIT-ML, with substantial gains in speed and editability for real-world animation applications.

Abstract

In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.

MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training

TL;DR

MoLA tackles the gap in text-to-motion systems by delivering fast, high-quality generation with variable lengths and versatile editing in a single framework. It combines a VAE-GAN with adversarial training (employing a SAN-based discriminator) to learn a compact motion latent space, and a text-conditioned latent diffusion model to generate motions from descriptions. Guided, training-free editing via MPGD enables path-following, in-betweening, and upper-body edits without retraining, while an activation variable enables variable-length outputs aligned with textual input. Together, these components yield state-of-the-art performance among continuous latent methods on benchmarks like HumanML3D and KIT-ML, with substantial gains in speed and editability for real-world animation applications.

Abstract

In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.
Paper Structure (28 sections, 17 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 28 sections, 17 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: MoLA achieves fast and high-quality human motion generation given textual descriptions while enabling motion editing applications. With MoLA, we can deal with various types of motion editing tasks in a single framework.
  • Figure 2: Comparison of inference cost, generation performance, and editability for text-to-motion methods on HumanML3D dataset (All tests are performed on NVIDIA A100 GPU). $\bullet$ means a method that can edit motion in a training-free manner, and $\times$ means a method that cannot edit motion in a training-free manner. The pink arrow in the figure indicates that MoLA significantly extends the performance boundaries (in terms of generation quality and speed) of methods categorized as enabling training-free editing.
  • Figure 3: The overall framework of MoLA. Stage 1: A motion VAE enhanced by adversarial training learns a low-dimensional latent representation of diverse motion sequences. Stage 2: A text-conditioned latent diffusion model leverages this representation for fast and high-quality text-to-motion generation. Guided generation: During inference, a gradient-based method minimizes a loss function $\mathfrak{L}_{\text{Motion}}$ for each desired editing task, enabling multiple motion editing tasks within a unified framework.
  • Figure 4: Comparison of motion length distributions between the HumanML3D test set and the generated samples. The Jensen-Shannon divergence (JSD) for each distribution is given by $\text{JSD}(\text{GT}||\text{T2M-GPT})=0.041$, $\text{JSD}(\text{GT}||\text{MoMASK})=0.040$, and $\text{JSD}(\text{GT}||\text{MoLA})=0.026$. Similarly, the Earth Mover’s Distance (EMD) for each distribution is given by $\mathcal{D}_{\text{EMD}}(\text{GT}, \text{T2M-GPT})=6.706$, $\mathcal{D}_{\text{EMD}}(\text{GT}, \text{MoMASK})=3.673$, and $\mathcal{D}_{\text{EMD}}(\text{GT}, \text{MoLA})=3.538$ (a unit in this EMD means 1 frame).
  • Figure 5: Qualitative results for the three editing tasks (path following, in-betweening, and upper-body editing). For these tasks, we treat each control signal (i) and (ii) in the left side of the figure as $\bm{y}$ in Equation \ref{['eq:mpgd_update']} and \ref{['eq:loss_for_editing']}. The corresponding generated results using the same input text are shown on the right side of the figure as (i) and (ii), respectively.
  • ...and 2 more figures