MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training
Kengo Uchida, Takashi Shibuya, Yuhta Takida, Naoki Murata, Julian Tanke, Shusuke Takahashi, Yuki Mitsufuji
TL;DR
MoLA tackles the gap in text-to-motion systems by delivering fast, high-quality generation with variable lengths and versatile editing in a single framework. It combines a VAE-GAN with adversarial training (employing a SAN-based discriminator) to learn a compact motion latent space, and a text-conditioned latent diffusion model to generate motions from descriptions. Guided, training-free editing via MPGD enables path-following, in-betweening, and upper-body edits without retraining, while an activation variable enables variable-length outputs aligned with textual input. Together, these components yield state-of-the-art performance among continuous latent methods on benchmarks like HumanML3D and KIT-ML, with substantial gains in speed and editability for real-world animation applications.
Abstract
In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.
