Table of Contents
Fetching ...

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong

TL;DR

SyncVoice tackles the problem of lip-synced, natural speech generation for video dubbing, including cross-lingual translation where lip movements may not align with the target speech. It builds on a pretrained flow-based TTS (ZipVoice) and augments it with a Text-Visual Fusion module and a Dual Speaker Encoder to leverage visual cues and mitigate inter-language interference. A multi-condition classifier-free guidance strategy enables fine-grained control over facial action, lip motion, and text in generation, and a diffusion-based training objective underpins robust audiovisual synchronization. Experiments on monolingual GRID and bilingual datasets demonstrate strong lip-sync, naturalness, and cross-lingual robustness, with ablations confirming the value of visual conditioning and the dual-encoder design.

Abstract

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

TL;DR

SyncVoice tackles the problem of lip-synced, natural speech generation for video dubbing, including cross-lingual translation where lip movements may not align with the target speech. It builds on a pretrained flow-based TTS (ZipVoice) and augments it with a Text-Visual Fusion module and a Dual Speaker Encoder to leverage visual cues and mitigate inter-language interference. A multi-condition classifier-free guidance strategy enables fine-grained control over facial action, lip motion, and text in generation, and a diffusion-based training objective underpins robust audiovisual synchronization. Experiments on monolingual GRID and bilingual datasets demonstrate strong lip-sync, naturalness, and cross-lingual robustness, with ablations confirming the value of visual conditioning and the dual-encoder design.

Abstract

Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.

Paper Structure

This paper contains 17 sections, 2 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The main architecture of the proposed method.
  • Figure 2: Detail of Text-Visual Fusion module.