HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng
TL;DR
HM-Talker tackles lip-sync and motion artifacts in audio-driven talking head synthesis by introducing a hybrid motion model that fuses explicit anatomical priors (Action Units) with implicit audio-driven cues. The Cross-Modal Disentanglement Module (CMDM) aligns audio and visual representations and enables AU supervision under audio-driven conditions, while the Hybrid Motion Modeling Module (HMMM) uses gated attention and stochastic feature pairing to achieve identity-agnostic generalization. Built atop a TalkingGaussian-inspired 3D Gaussian Splatting backbone, the approach achieves high visual fidelity and precise lip articulation, outperforming NeRF- and 3DGS-based baselines while maintaining real-time rendering. The results on multi-identity data show strong lip-sync accuracy and robust cross-subject generalization, indicating practical potential for personalized yet scalable talking head synthesis.
Abstract
Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.
