Table of Contents
Fetching ...

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

Fan Zhang, Naye Ji, Fuxing Gao, Bozuo Zhao, Jingmei Wu, Yanbing Jiang, Hui Du, Zhenqing Ye, Jiayang Zhu, WeiFan Zhong, Leyao Yan, Xiaomeng Ma

TL;DR

DiM-Gesture introduces a memory-efficient, diffusion-based framework for co-speech gesture generation that derives highly personalized 3D full-body gestures directly from raw speech. It replaces Transformer components with a Mamba-2-based AdaLN conditioning and a dual-component fuzzy feature extractor, unifying local audio latent features with global style information to condition a diffusion model. Through extensive experiments on ZEGGS and BEAT, DiM-Gesture achieves competitive gesture quality and synchronization compared to Persona-Gestor while reducing memory footprint and speeding up inference. The approach eliminates the need for explicit style labels, enabling end-to-end generation of speaker-aware gestures with improved efficiency and practical potential for real-time virtual humans.

Abstract

Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.

DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework

TL;DR

DiM-Gesture introduces a memory-efficient, diffusion-based framework for co-speech gesture generation that derives highly personalized 3D full-body gestures directly from raw speech. It replaces Transformer components with a Mamba-2-based AdaLN conditioning and a dual-component fuzzy feature extractor, unifying local audio latent features with global style information to condition a diffusion model. Through extensive experiments on ZEGGS and BEAT, DiM-Gesture achieves competitive gesture quality and synchronization compared to Persona-Gestor while reducing memory footprint and speeding up inference. The approach eliminates the need for explicit style labels, enabling end-to-end generation of speaker-aware gestures with improved efficiency and practical potential for real-time virtual humans.

Abstract

Speech-driven gesture generation is an emerging domain within virtual human creation, where current methods predominantly utilize Transformer-based architectures that necessitate extensive memory and are characterized by slow inference speeds. In response to these limitations, we propose \textit{DiM-Gestures}, a novel end-to-end generative model crafted to create highly personalized 3D full-body gestures solely from raw speech audio, employing Mamba-based architectures. This model integrates a Mamba-based fuzzy feature extractor with a non-autoregressive Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. The extractor, leveraging a Mamba framework and a WavLM pre-trained model, autonomously derives implicit, continuous fuzzy features, which are then unified into a singular latent feature. This feature is processed by the AdaLN Mamba-2, which implements a uniform conditional mechanism across all tokens to robustly model the interplay between the fuzzy features and the resultant gesture sequence. This innovative approach guarantees high fidelity in gesture-speech synchronization while maintaining the naturalness of the gestures. Employing a diffusion model for training and inference, our framework has undergone extensive subjective and objective evaluations on the ZEGGS and BEAT datasets. These assessments substantiate our model's enhanced performance relative to contemporary state-of-the-art methods, demonstrating competitive outcomes with the DiTs architecture (Persona-Gestors) while optimizing memory usage and accelerating inference speed.
Paper Structure (24 sections, 4 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 24 sections, 4 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: The architecture of DiM-Gesture strategically incorporates a Mamba-based fuzzy feature extractor and an Adaptive Layer Normalization (AdaLN) Mamba-2 diffusion architecture. At its core, the fuzzy feature extractor features a dual-component system engineered to capture both the nuanced style and detailed audio features. These unified latent features are subsequently channeled into the AdaLN Mamba-2. This module is pivotal in modeling the intricate relationship between the incoming audio features and the corresponding gestures. It facilitates the precise estimation of diffusion noise within the diffusion model, ensuring the generation of diverse gestures. The overall schematic includes three main components: (a) Overall Schematic, (b) Mamba-based Style Fuzzy Feature Extractor, and (c) AdaLN Mamba-2 Block.
  • Figure 2: An overview of the Mamba-based style fuzzy inference condition extractor.
  • Figure 3: Samples of gestures corresponding to various emotions are presented in the subfigure. The left side illustrates the ground truth gestures, the middle section showcases gestures generated by our architecture (DiM-Gesture), and the right side presents synthesized gestures using Persona-Gestor.