MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

Chencan Fu; Yabiao Wang; Jiangning Zhang; Zhengkai Jiang; Xiaofeng Mao; Jiafu Wu; Weijian Cao; Chengjie Wang; Yanhao Ge; Yong Liu

MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

Chencan Fu, Yabiao Wang, Jiangning Zhang, Zhengkai Jiang, Xiaofeng Mao, Jiafu Wu, Weijian Cao, Chengjie Wang, Yanhao Ge, Yong Liu

TL;DR

This work presents MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD, achieving state-of-the-art performance in co-speech gesture generation.

Abstract

Co-speech gesture generation is crucial for producing synchronized and realistic human gestures that accompany speech, enhancing the animation of lifelike avatars in virtual environments. While diffusion models have shown impressive capabilities, current approaches often overlook a wide range of modalities and their interactions, resulting in less dynamic and contextually varied gestures. To address these challenges, we present MambaGesture, a novel framework integrating a Mamba-based attention block, MambaAttn, with a multi-modality feature fusion module, SEAD. The MambaAttn block combines the sequential data processing strengths of the Mamba model with the contextual richness of attention mechanisms, enhancing the temporal coherence of generated gestures. SEAD adeptly fuses audio, text, style, and emotion modalities, employing disentanglement to deepen the fusion process and yield gestures with greater realism and diversity. Our approach, rigorously evaluated on the multi-modal BEAT dataset, demonstrates significant improvements in Fréchet Gesture Distance (FGD), diversity scores, and beat alignment, achieving state-of-the-art performance in co-speech gesture generation. Project website: $\href{https://fcchit.github.io/mambagesture/}{\textit{https://fcchit.github.io/mambagesture/}}$.

MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 3 figures, 5 tables)

This paper contains 17 sections, 12 equations, 3 figures, 5 tables.

Introduction
Related Work
Co-speech Gesture Generation
Diffusion-based Gesture Generation
State Space Models
Preliminary
Human Gesture Data Format
Denoising Diffusion Probabilistic Model
Method
Disentangled Multi-Modal Fusion
MambaAttn Denoiser
Experiments
Experiment Settings
Quantitative Results.
Qualitative Results.
...and 2 more sections

Figures (3)

Figure 1: Comparison of our approach with mainstream co-speech generation methods. (a) Autoencoder (AE)-based methods liu2022beatliu2023emage synthesize gestures by fusing multi-modal data but inherently suffer from limited diversity due to architectural constraints. (b) Diffusion-based methods yang2023diffusestylegestureyang2023diffusestylegesture_plus employ diffusion models with Transformers to generate diverse gestures but are hindered by the quadratic complexity of Transformer and often overlook intricate multi-modal correlations. (c) Our MambaGesture leverages the linear scaling and sequential data processing advantages of the State Space Model to enhance gesture diversity and effectively harness multi-modal data with disentangled feature fusion, ensuring a broader spectrum and higher realism in gesture generation.
Figure 2: Overview of our proposed MambaGesture. We introduce a novel feature fusion strategy: the cross-attention enhanced Style and Emotion Aware Disentangled (SEAD) feature fusion module. This module employs style $\bm{s}$, audio $\bm{a}$, emotion $\bm{e}$, and text $\bm{text}$ as conditions to provide comprehensive information and effectively disentangle style and emotion from the audio. The $\bm{f^{\prime}_{se}}$ is obtained by concatenating $\bm{f^{\prime}_s}$ and $\bm{f^{\prime}_e}$, and projected to original dimension by linear layer. Besides, we present a Mamba-based component termed the MambaAttn block, which merges Mamba with its sequence modeling proficiency and employs an attention mechanism to learn global information. Our denoising architecture, MambaAttn denoiser, is composed of a stack of MambaAttn blocks and a linear layer. During the sampling phase, we predict the gesture $\bm{\hat{x}_0}$ by applying the fused conditions within a cyclical denoising and diffusion procedure.
Figure 3: Visualization results comparing state-of-the-art methods. Speech transcript: "... when you have to work Monday through Friday the whole week, you are very tired ..."

MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

TL;DR

Abstract

MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (3)