JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model
Farzaneh Jafari, Stefano Berretti, Anup Basu
TL;DR
<3-5 sentence high-level summary> JambaTalk tackles the challenge of long-context, speech-driven 3D talking head generation by marrying a Mamba-based selective state-space model with Transformer blocks, enhanced by Mixture-of-Experts routing and efficient attention techniques. The approach introduces Low-Rank Learned Rotary Positional Embedding and Grouped Query Attention to balance memory efficiency with modeling power, enabling context windows up to 256K tokens on standard GPUs. Extensive experiments on VOCASET and BIWI_6 show improvements in lip synchronization, upper-face dynamics, and motion realism, supported by ablations that confirm the contribution of each component. Real-time inference with streaming audio demonstrates practical viability for interactive applications, with user studies further validating perceptual gains over state-of-the-art baselines.
Abstract
In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.
