Strong and Controllable 3D Motion Generation
Canxuan Gang
TL;DR
This work tackles real-time, text-driven 3D motion generation by addressing two core issues: inference efficiency and fine-grained joint control. It introduces two primary innovations: an Efficient Motion Transformer that uses flash linear attention to achieve linear computational complexity within the latent diffusion framework, and Motion ControlNet combined with latent consistency distillation to provide precise joint-level control in the motion latent space. Together, these components aim to accelerate motion generation while enabling controllable, high-fidelity motions suitable for real-world applications in gaming, robotics, and AR/VR. The authors outline a detailed plan encompassing literature review, implementation, comparative and ablation studies on HumanML3D, and manuscript preparation for top venues, underlining the practical impact and potential for real-time, controllable motion synthesis.
Abstract
Human motion generation is a significant pursuit in generative computer vision with widespread applications in film-making, video games, AR/VR, and human-robot interaction. Current methods mainly utilize either diffusion-based generative models or autoregressive models for text-to-motion generation. However, they face two significant challenges: (1) The generation process is time-consuming, posing a major obstacle for real-time applications such as gaming, robot manipulation, and other online settings. (2) These methods typically learn a relative motion representation guided by text, making it difficult to generate motion sequences with precise joint-level control. These challenges significantly hinder progress and limit the real-world application of human motion generation techniques. To address this gap, we propose a simple yet effective architecture consisting of two key components. Firstly, we aim to improve hardware efficiency and computational complexity in transformer-based diffusion models for human motion generation. By customizing flash linear attention, we can optimize these models specifically for generating human motion efficiently. Furthermore, we will customize the consistency model in the motion latent space to further accelerate motion generation. Secondly, we introduce Motion ControlNet, which enables more precise joint-level control of human motion compared to previous text-to-motion generation methods. These contributions represent a significant advancement for text-to-motion generation, bringing it closer to real-world applications.
