Table of Contents
Fetching ...

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

TL;DR

HM-Talker tackles lip-sync and motion artifacts in audio-driven talking head synthesis by introducing a hybrid motion model that fuses explicit anatomical priors (Action Units) with implicit audio-driven cues. The Cross-Modal Disentanglement Module (CMDM) aligns audio and visual representations and enables AU supervision under audio-driven conditions, while the Hybrid Motion Modeling Module (HMMM) uses gated attention and stochastic feature pairing to achieve identity-agnostic generalization. Built atop a TalkingGaussian-inspired 3D Gaussian Splatting backbone, the approach achieves high visual fidelity and precise lip articulation, outperforming NeRF- and 3DGS-based baselines while maintaining real-time rendering. The results on multi-identity data show strong lip-sync accuracy and robust cross-subject generalization, indicating practical potential for personalized yet scalable talking head synthesis.

Abstract

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

TL;DR

HM-Talker tackles lip-sync and motion artifacts in audio-driven talking head synthesis by introducing a hybrid motion model that fuses explicit anatomical priors (Action Units) with implicit audio-driven cues. The Cross-Modal Disentanglement Module (CMDM) aligns audio and visual representations and enables AU supervision under audio-driven conditions, while the Hybrid Motion Modeling Module (HMMM) uses gated attention and stochastic feature pairing to achieve identity-agnostic generalization. Built atop a TalkingGaussian-inspired 3D Gaussian Splatting backbone, the approach achieves high visual fidelity and precise lip articulation, outperforming NeRF- and 3DGS-based baselines while maintaining real-time rendering. The results on multi-identity data show strong lip-sync accuracy and robust cross-subject generalization, indicating practical potential for personalized yet scalable talking head synthesis.

Abstract

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

Paper Structure

This paper contains 11 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Talking head synthesis. Existing methods predominantly model lower face motion through either purely implicit (a) or purely explicit (b) schemes, consequently suffering from rigid expressions and weak audio alignment or motion blur and lip jitter. Our method employs a hybrid explicit-implicit formulation for lower face motion modeling, achieving anatomy-aware and prosody-aware facial animation synthesis.
  • Figure 2: Overview of HM-Talker. Reference Images are semantically decomposed into head images, mouth images, and background images. The head image initializes a static Gaussian field. A Tri-plane Encoder then extracts positional encodings, denoted $\mathcal{H}(\mu)$ from this field. Concurrently, audio input and the head image are processed by the Cross-Modal Disentanglement Module (CMDM). This module outputs explicit motion features ($\mathbf{c}_{a,l}^{e}$, $\mathbf{c}_{v,l}^{e}$) and implicit motion features ($\mathbf{c}_{a,l}^{i}$, $\mathbf{c}_{a,l}^{i\text{-}mask}$). These features, combined with upper-face features ($\mathbf{c}_{v,u}^{e}$), are fed into the Hybrid Motion Modeling Module (HMMM). The HMMM uses $\mathcal{H}(\mu)$ to compute region-specific attention. It then fuses randomly selected pairs of motion features to generate the lower-face control vector $\mathbf{C}_f$. This vector, together with the upper-face control vector $\mathbf{C}_u$, predicts the deformation $\delta$ applied to the static Gaussian field. Finally, a 3DGS Rasterizer renders the dynamic facial image. This result is alpha-blended with outputs from the Inside Mouth Branch and the background image to produce the audio-driven output.
  • Figure 3: User study. The rating scale ranges from 1 to 5, with higher numbers indicating better performance.
  • Figure 4: t-SNE visualization of motion features over three 20-frame clips. Each row corresponds to one clip; each column represents a different audio input. Here, "Explicit" means audio-predict explicit features.
  • Figure 5: Qualitative results of Image Quality Comparison. Compared with other methods, our approach achieves the most consistent phoneme-viseme alignment performance, where TGS denotes TalkingGaussian. Please zoom in for better visualization.
  • ...and 3 more figures