MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
Xiaofeng Mao, Zhengkai Jiang, Qilin Wang, Chencan Fu, Jiangning Zhang, Jiafu Wu, Yabiao Wang, Chengjie Wang, Wei Li, Mingmin Chi
TL;DR
MDT-A2G tackles co-speech gesture generation by fusing multi-modal cues and applying a Masked Diffusion Transformer to the gesture sequence, enabling efficient and coherent motion synthesis. The method introduces a masked modeling scheme, a simple-yet-effective multi-modal fusion module, and a scaling-aware accelerated sampling process, yielding faster training (over 6x) and faster inference (about 5.7x) than standard diffusion models. Experiments on BEAT show state-of-the-art gesture quality, diversity, and temporal alignment for both whole-body and upper-body motions; qualitative results confirm realistic and expressive gestures. The work advances co-speech gesture generation by combining mask-based learning with diffusion-based generation and multi-modal conditioning, with potential for responsive avatars and interactive systems.
Abstract
Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$\times$ faster than traditional diffusion transformers and an inference speed that is 5.7$\times$ than the standard diffusion model.
