Table of Contents
Fetching ...

MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, Xiu Li

TL;DR

The paper tackles efficient co-speech gesture synthesis for long sequences with low latency by introducing MambaTalk, a two-stage framework that first learns discrete motion priors via VQ-VAE and then trains speech-conditioned selective state-space models with global and local scan modules. By integrating Mamba-based selective scanning and part-specific decoders, the approach achieves diverse, rhythmically aligned gestures and improves facial motion fidelity, beating state-of-the-art holistic gesture methods on BEAT2. Extensive ablations demonstrate the importance of discrete priors, scanning strategies, and audio encoders for cross-modal synthesis. The work advances interactive HCI applications by delivering high-quality, low-latency full-body gesture generation suitable for film, robotics, and virtual environments, with public release of the codebase.

Abstract

Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models. Our project is publicly available at https://kkakkkka.github.io/MambaTalk

MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models

TL;DR

The paper tackles efficient co-speech gesture synthesis for long sequences with low latency by introducing MambaTalk, a two-stage framework that first learns discrete motion priors via VQ-VAE and then trains speech-conditioned selective state-space models with global and local scan modules. By integrating Mamba-based selective scanning and part-specific decoders, the approach achieves diverse, rhythmically aligned gestures and improves facial motion fidelity, beating state-of-the-art holistic gesture methods on BEAT2. Extensive ablations demonstrate the importance of discrete priors, scanning strategies, and audio encoders for cross-modal synthesis. The work advances interactive HCI applications by delivering high-quality, low-latency full-body gesture generation suitable for film, robotics, and virtual environments, with public release of the codebase.

Abstract

Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models. Our project is publicly available at https://kkakkkka.github.io/MambaTalk
Paper Structure (24 sections, 17 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 17 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our two-stage method for co-speech gesture generation with selective state space models. In the first stage, we construct discrete motion spaces to learn specific motion codes. In the second stage, we develop a speech-driven model of the latent space using selective scanning mechanisms.
  • Figure 2: We propose a two-stage method for co-speech gesture generation. We first train multiple VQ-VAEs for face and different parts of body reconstruction. This step learns discrete motion priors through multiple codebooks. In the second stage, we train a speech-driven gesture generation model in the latent motion space with local and global scan modules.
  • Figure 3: Visualization of the gestures generated by CaMN, EMAGE and our method. Unreasonable results are indicated by red boxes and reasonable ones by green boxes.
  • Figure 4: Visualization of the facial motions generated by CaMN, EMAGE and our method. Unreasonable results are indicated by red and gray boxes and reasonable ones by green boxes.
  • Figure 5: Visualization of the gestures generated by CaMN, EMAGE and our method. Unreasonable results are indicated by red boxes and reasonable ones by green boxes.
  • ...and 1 more figures