MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

Yue Zhang; Zhizhou Zhong; Minhao Liu; Zhaokang Chen; Bin Wu; Yubin Zeng; Chao Zhan; Yingjie He; Junxin Huang; Wenjiang Zhou

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou

TL;DR

MuseTalk tackles real-time, high-fidelity video dubbing by training in a latent space with spatio-temporal sampling. It introduces Informative Frame Sampling for stable facial abstraction and Dynamic Margin Sampling to balance lip-sync and teeth detail during adversarial finetuning. The method achieves 30 FPS at 256×256 on a V100 and outperforms state-of-the-art approaches in visual fidelity while maintaining competitive lip-sync accuracy. This approach holds practical potential for multilingual dubbing and expansive virtual content creation.

Abstract

Real-time video dubbing that preserves identity consistency while achieving accurate lip synchronization remains a critical challenge. Existing approaches face a trilemma: diffusion-based methods achieve high visual fidelity but suffer from prohibitive computational costs, while GAN-based solutions sacrifice lip-sync accuracy or dental details for real-time performance. We present MuseTalk, a novel two-stage training framework that resolves this trade-off through latent space optimization and spatio-temporal data sampling strategy. Our key innovations include: (1) During the Facial Abstract Pretraining stage, we propose Informative Frame Sampling to temporally align reference-source pose pairs, eliminating redundant feature interference while preserving identity cues. (2) In the Lip-Sync Adversarial Finetuning stage, we employ Dynamic Margin Sampling to spatially select the most suitable lip-movement-promoting regions, balancing audio-visual synchronization and dental clarity. (3) MuseTalk establishes an effective audio-visual feature fusion framework in the latent space, delivering 30 FPS output at 256*256 resolution on an NVIDIA V100 GPU. Extensive experiments demonstrate that MuseTalk outperforms state-of-the-art methods in visual fidelity while achieving comparable lip-sync accuracy. %The codes and models will be made publicly available upon acceptance. The code is made available at \href{https://github.com/TMElyralab/MuseTalk}{https://github.com/TMElyralab/MuseTalk}

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

TL;DR

Abstract

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)