Table of Contents
Fetching ...

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

Farzaneh Jafari, Stefano Berretti, Anup Basu

TL;DR

<3-5 sentence high-level summary> JambaTalk tackles the challenge of long-context, speech-driven 3D talking head generation by marrying a Mamba-based selective state-space model with Transformer blocks, enhanced by Mixture-of-Experts routing and efficient attention techniques. The approach introduces Low-Rank Learned Rotary Positional Embedding and Grouped Query Attention to balance memory efficiency with modeling power, enabling context windows up to 256K tokens on standard GPUs. Extensive experiments on VOCASET and BIWI_6 show improvements in lip synchronization, upper-face dynamics, and motion realism, supported by ablations that confirm the contribution of each component. Real-time inference with streaming audio demonstrates practical viability for interactive applications, with user studies further validating perceptual gains over state-of-the-art baselines.

Abstract

In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

TL;DR

<3-5 sentence high-level summary> JambaTalk tackles the challenge of long-context, speech-driven 3D talking head generation by marrying a Mamba-based selective state-space model with Transformer blocks, enhanced by Mixture-of-Experts routing and efficient attention techniques. The approach introduces Low-Rank Learned Rotary Positional Embedding and Grouped Query Attention to balance memory efficiency with modeling power, enabling context windows up to 256K tokens on standard GPUs. Extensive experiments on VOCASET and BIWI_6 show improvements in lip synchronization, upper-face dynamics, and motion realism, supported by ablations that confirm the contribution of each component. Real-time inference with streaming audio demonstrates practical viability for interactive applications, with user studies further validating perceptual gains over state-of-the-art baselines.

Abstract

In recent years, the talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high-quality video. However, no single model has yet achieved equivalence across all quantitative and qualitative metrics. We introduce Jamba, a hybrid Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured State Space Model (SSM) architecture, was developed to overcome the limitations of conventional Transformer architectures, particularly in handling long sequences. This challenge has constrained traditional models. Jamba combines the advantages of both the Transformer and Mamba approaches, offering a comprehensive solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and lip sync through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.
Paper Structure (29 sections, 21 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 29 sections, 21 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of JambaTalk: The Wav2Vec 2.0 model is used to extract features from the input speech, with the encoder initialized using pre-trained weights from the original model Baevski2020wav2vec. These encoded features are passed to the JambaTalk decoder, which generates a sequence of animated 3D face meshes. The Transformer layer incorporates Low-Rank Learned Rotary Positional Embedding (LRL-RoPE) and Grouped-Query Attention (GQA), providing a computation-efficient alternative to traditional Transformers. The lip feature extraction block then converts motion decoder outputs into lip deformation features by selecting lip vertices with a lip mask, which are processed by a Transformer-based lip encoder in the lip reader module to synchronize lip shapes.
  • Figure 2: Details of the Mamba and MoE_Mamba layers in the JambaTalk Decoder. Both layers begin with an RMSNorm normalization followed by a Mamba block for sequence modeling and include residual connections to preserve gradient flow. In the standard Mamba Layer (left), the Mamba output is followed by another RMSNorm and a feedforward MLP block. In contrast, the MoE_Mamba Layer (right) replaces the MLP with a Mixture-of-Experts (MoE) module, enabling dynamic expert routing per token and enhancing model capacity while maintaining computational efficiency Lieber2024Jamba.
  • Figure 3: A visual comparison of frames from synthesized facial animation sequences produced by various methods, alongside reference frames from the ground-truth sequence. The red utterances are depicted in the visual frames. Our approach generates lip shapes that closely resemble the reference frames. Left: $BIWI_6$ Test-B. Right: Vocaset Test.
  • Figure 4: The temporal statistics (mean and standard deviation) of motion variations between adjacent frames in the sequence on Vocaset Test and $BIWI_6$ Test-B datasets.
  • Figure 5: Lip opening distance over time, showing the variation in 3D Euclidean distance between the upper and lower lip landmarks for each video frame. Peaks indicate moments when the mouth is open wider, while valleys correspond to smaller openings or closed lips on $BIWI_6$ Test-B dataset.
  • ...and 3 more figures