Table of Contents
Fetching ...

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou

TL;DR

MuseTalk tackles real-time, high-fidelity video dubbing by training in a latent space with spatio-temporal sampling. It introduces Informative Frame Sampling for stable facial abstraction and Dynamic Margin Sampling to balance lip-sync and teeth detail during adversarial finetuning. The method achieves 30 FPS at 256×256 on a V100 and outperforms state-of-the-art approaches in visual fidelity while maintaining competitive lip-sync accuracy. This approach holds practical potential for multilingual dubbing and expansive virtual content creation.

Abstract

Real-time video dubbing that preserves identity consistency while achieving accurate lip synchronization remains a critical challenge. Existing approaches face a trilemma: diffusion-based methods achieve high visual fidelity but suffer from prohibitive computational costs, while GAN-based solutions sacrifice lip-sync accuracy or dental details for real-time performance. We present MuseTalk, a novel two-stage training framework that resolves this trade-off through latent space optimization and spatio-temporal data sampling strategy. Our key innovations include: (1) During the Facial Abstract Pretraining stage, we propose Informative Frame Sampling to temporally align reference-source pose pairs, eliminating redundant feature interference while preserving identity cues. (2) In the Lip-Sync Adversarial Finetuning stage, we employ Dynamic Margin Sampling to spatially select the most suitable lip-movement-promoting regions, balancing audio-visual synchronization and dental clarity. (3) MuseTalk establishes an effective audio-visual feature fusion framework in the latent space, delivering 30 FPS output at 256*256 resolution on an NVIDIA V100 GPU. Extensive experiments demonstrate that MuseTalk outperforms state-of-the-art methods in visual fidelity while achieving comparable lip-sync accuracy. %The codes and models will be made publicly available upon acceptance. The code is made available at \href{https://github.com/TMElyralab/MuseTalk}{https://github.com/TMElyralab/MuseTalk}

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

TL;DR

MuseTalk tackles real-time, high-fidelity video dubbing by training in a latent space with spatio-temporal sampling. It introduces Informative Frame Sampling for stable facial abstraction and Dynamic Margin Sampling to balance lip-sync and teeth detail during adversarial finetuning. The method achieves 30 FPS at 256×256 on a V100 and outperforms state-of-the-art approaches in visual fidelity while maintaining competitive lip-sync accuracy. This approach holds practical potential for multilingual dubbing and expansive virtual content creation.

Abstract

Real-time video dubbing that preserves identity consistency while achieving accurate lip synchronization remains a critical challenge. Existing approaches face a trilemma: diffusion-based methods achieve high visual fidelity but suffer from prohibitive computational costs, while GAN-based solutions sacrifice lip-sync accuracy or dental details for real-time performance. We present MuseTalk, a novel two-stage training framework that resolves this trade-off through latent space optimization and spatio-temporal data sampling strategy. Our key innovations include: (1) During the Facial Abstract Pretraining stage, we propose Informative Frame Sampling to temporally align reference-source pose pairs, eliminating redundant feature interference while preserving identity cues. (2) In the Lip-Sync Adversarial Finetuning stage, we employ Dynamic Margin Sampling to spatially select the most suitable lip-movement-promoting regions, balancing audio-visual synchronization and dental clarity. (3) MuseTalk establishes an effective audio-visual feature fusion framework in the latent space, delivering 30 FPS output at 256*256 resolution on an NVIDIA V100 GPU. Extensive experiments demonstrate that MuseTalk outperforms state-of-the-art methods in visual fidelity while achieving comparable lip-sync accuracy. %The codes and models will be made publicly available upon acceptance. The code is made available at \href{https://github.com/TMElyralab/MuseTalk}{https://github.com/TMElyralab/MuseTalk}

Paper Structure

This paper contains 24 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The difference between the talking head generation and the video dubbing. Zoom in to see the differences in the lip area. MuseTalk can efficiently generate video frames in one step for video dubbing task.
  • Figure 2: Illustration of MuseTalk's framework. We first encode a reference facial image and an occluded lower half target image into perceptually equivalent latent space. Subsequently, we employ a multimodal U-Net to effectively fuse audio and visual features at various scales. Consequently, the decoded results from the latent space yield more realistic and lip-synced talking face visual content.
  • Figure 3: The illustration of proposed Informative Frame Sampling mechanism. We calculate the pose and lip similarity based on Euclidean distance between facial landmarks.
  • Figure 4: (a) Identity image during Inference. (b) First-stage model generates smooth teeth. (c) SyncNet loss promotes accurate lip movements but causes blurring. (d) GAN loss enhances clear teeth but replicates the original lip. (e) After applying DMS, both accurate lip movements and clear teeth are generated.
  • Figure 5: The principle of Dynamic Margin Sampling (DMS) in promoting lip movement learning. Without DMS, the model can easily infer the general lip shape of $I_{gt}^{t}$ from the relative position of the nose in the input images $I_{\text{ref}}^{t}$ and $I_{s}^{t}$. With DMS, this cue is weakened, forcing the model to learn the lip movements.
  • ...and 1 more figures