Table of Contents
Fetching ...

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

Xiaofeng Mao, Zhengkai Jiang, Qilin Wang, Chencan Fu, Jiangning Zhang, Jiafu Wu, Yabiao Wang, Chengjie Wang, Wei Li, Mingmin Chi

TL;DR

MDT-A2G tackles co-speech gesture generation by fusing multi-modal cues and applying a Masked Diffusion Transformer to the gesture sequence, enabling efficient and coherent motion synthesis. The method introduces a masked modeling scheme, a simple-yet-effective multi-modal fusion module, and a scaling-aware accelerated sampling process, yielding faster training (over 6x) and faster inference (about 5.7x) than standard diffusion models. Experiments on BEAT show state-of-the-art gesture quality, diversity, and temporal alignment for both whole-body and upper-body motions; qualitative results confirm realistic and expressive gestures. The work advances co-speech gesture generation by combining mask-based learning with diffusion-based generation and multi-modal conditioning, with potential for responsive avatars and interactive systems.

Abstract

Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$\times$ faster than traditional diffusion transformers and an inference speed that is 5.7$\times$ than the standard diffusion model.

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

TL;DR

MDT-A2G tackles co-speech gesture generation by fusing multi-modal cues and applying a Masked Diffusion Transformer to the gesture sequence, enabling efficient and coherent motion synthesis. The method introduces a masked modeling scheme, a simple-yet-effective multi-modal fusion module, and a scaling-aware accelerated sampling process, yielding faster training (over 6x) and faster inference (about 5.7x) than standard diffusion models. Experiments on BEAT show state-of-the-art gesture quality, diversity, and temporal alignment for both whole-body and upper-body motions; qualitative results confirm realistic and expressive gestures. The work advances co-speech gesture generation by combining mask-based learning with diffusion-based generation and multi-modal conditioning, with potential for responsive avatars and interactive systems.

Abstract

Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6 faster than traditional diffusion transformers and an inference speed that is 5.7 than the standard diffusion model.
Paper Structure (23 sections, 1 theorem, 8 equations, 4 figures, 7 tables)

This paper contains 23 sections, 1 theorem, 8 equations, 4 figures, 7 tables.

Key Result

theorem 1

Given $\hat{x}^t_0$ and the sampled $\hat{x}_{t}$ at time t, we can compute $\hat{x}^{t-1}_0$ for time t-1 without relying on neural networks. The subsequent formula is employed to diminish the exposure bias: Here, $scale$ is a hyperparameter greater than 1 but close to 1.

Figures (4)

  • Figure 1: Comparison between DSG+ yang2023diffusestylegesture_plus and our MDT-A2G-B with respect to training steps/times on a single A100 GPU. Compared to DSG+, MDT-A2G-B exhibits a faster training convergence speed and superior performance, demonstrating the effectiveness of proposed method.
  • Figure 2: Overview of MDT-A2G. It primarily consists of three components: (1) Composite Multi-modal Feature Extractor, (2) Masked Diffusion Transformers, and (3) Scaling-aware Accelerated Sampling Process. For the multi-modal feature extractor, we propose an innovative feature fusion strategy that integrates time embeddings with emotion and ID features. These will be further concatenated with text, audio, and gesture features, resulting in a comprehensive feature representation. Additionally, we have designed a Masked Diffusion Transformer structure to expedite the convergence of the denoising network, thereby leading to more coherent motions. Finally, we introduce a scaling-aware accelerated sampling process by utilizing diffused results from previous timesteps, resulting in a faster sampling process.
  • Figure 3: Qualitative comparison of whole motion generation. Refer to the supplementary video for a more intuitive comparison.
  • Figure 4: Comparison with different acceleration ratio. AR is Acceleration ratio. All use MDT-A2G-B

Theorems & Definitions (1)

  • theorem 1