MambaTalk: Efficient Holistic Gesture Synthesis with Selective State Space Models
Zunnan Xu, Yukang Lin, Haonan Han, Sicheng Yang, Ronghui Li, Yachao Zhang, Xiu Li
TL;DR
The paper tackles efficient co-speech gesture synthesis for long sequences with low latency by introducing MambaTalk, a two-stage framework that first learns discrete motion priors via VQ-VAE and then trains speech-conditioned selective state-space models with global and local scan modules. By integrating Mamba-based selective scanning and part-specific decoders, the approach achieves diverse, rhythmically aligned gestures and improves facial motion fidelity, beating state-of-the-art holistic gesture methods on BEAT2. Extensive ablations demonstrate the importance of discrete priors, scanning strategies, and audio encoders for cross-modal synthesis. The work advances interactive HCI applications by delivering high-quality, low-latency full-body gesture generation suitable for film, robotics, and virtual environments, with public release of the codebase.
Abstract
Gesture synthesis is a vital realm of human-computer interaction, with wide-ranging applications across various fields like film, robotics, and virtual reality. Recent advancements have utilized the diffusion model and attention mechanisms to improve gesture synthesis. However, due to the high computational complexity of these techniques, generating long and diverse sequences with low latency remains a challenge. We explore the potential of state space models (SSMs) to address the challenge, implementing a two-stage modeling strategy with discrete motion priors to enhance the quality of gestures. Leveraging the foundational Mamba block, we introduce MambaTalk, enhancing gesture diversity and rhythm through multimodal integration. Extensive experiments demonstrate that our method matches or exceeds the performance of state-of-the-art models. Our project is publicly available at https://kkakkkka.github.io/MambaTalk
