Table of Contents
Fetching ...

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath

TL;DR

This work addresses automated video dubbing by extending Neural Codec Language Models (NCLMs) to condition speech synthesis on source audio, target text, and a target video. It introduces audio-visual adapters and AV fusion layers that embed lip and facial cues into the NCLM token space, enabling time-aligned, expressive speech via a Transformer decoder and Encodec vocoder. A new expressive dataset, CelebV-Dub, complements LRS3 to evaluate dubbing under real-world, emotional conditions. Empirical results show VoiceCraft-Dub achieves superior naturalness, intelligibility, and lip synchronization compared with baselines and approaches ground-truth quality, and its versatility is demonstrated through a video-to-speech extension. This work advances immersive dubbing and accessibility, offering a scalable, multimodal framework for high-fidelity, lip-synced speech synthesis conditioned on visible facial cues.

Abstract

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

TL;DR

This work addresses automated video dubbing by extending Neural Codec Language Models (NCLMs) to condition speech synthesis on source audio, target text, and a target video. It introduces audio-visual adapters and AV fusion layers that embed lip and facial cues into the NCLM token space, enabling time-aligned, expressive speech via a Transformer decoder and Encodec vocoder. A new expressive dataset, CelebV-Dub, complements LRS3 to evaluate dubbing under real-world, emotional conditions. Empirical results show VoiceCraft-Dub achieves superior naturalness, intelligibility, and lip synchronization compared with baselines and approaches ground-truth quality, and its versatility is demonstrated through a video-to-speech extension. This work advances immersive dubbing and accessibility, offering a scalable, multimodal framework for high-fidelity, lip-synced speech synthesis conditioned on visible facial cues.

Abstract

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

Paper Structure

This paper contains 54 sections, 6 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Automated video dubbing. (a) Unlike text-to-speech, which generates diverse speech based on target text, automated video dubbing requires synthesized speech to be temporally and expressively aligned with the video while maintaining naturalness and intelligibility. (b) Examples of synthesized speech from VoiceCraft-Dub show that each speech is aligned with the lip movements of the input video. We strongly encourage listening to each of the synthesized samples in https://voicecraft-dub.github.io/.
  • Figure 2: Our proposed approach. (a) The Transformer decoder autoregressively generates audio tokens from phonemized text tokens, source speech tokens (extracted via the Encodec encoder), and audio-visual fused tokens, which combine the target speech token with lip and facial tokens from the target video. The numbers in each token denote the timestep. (b) An audio-visual fusion layer aligns the generated target speech tokens with the preceding lip and face tokens, effectively merging their information for better synchronization. (c) Finally, the generated target speech tokens are decoded by Encodec to synthesize high-quality speech that is temporally aligned with the video.
  • Figure 3: Qualitative results. We compare mel-spectrogram visualizations from ground truth recordings, our model, and prior methods hpmdubbingstyledubber on LRS3 (columns 1–2) and CelebV-Dub (columns 3–4). The texts below each mel-spectrogram represent time-aligned speech extracted using Whisper whisper, with red text indicating incorrectly synthesized speech.
  • Figure S1: Correlation between human evaluation and lip-sync objective metric. We visualize the scatter plot showing the relationship between the objective lip-sync metric (LSE-D) and the subjective Mean Opinion Score (MOS) on lip-sync from human evaluation. We observe a weak correlation between the two, with a correlation coefficient of 0.36, indicating that LSE-D should be used as a reference rather than a definitive metric.
  • Figure S2: Qualitative results. We compare mel-spectrogram visualizations from ground-truth recordings, our model, and existing methods hpmdubbingstyledubber on the LRS3 (columns 1–2) and our CelebV-Dub (columns 3–4) datasets. The texts below each mel-spectrogram represent time-aligned speech extracted using Whisper whisper, with red text indicating incorrectly synthesized speech.
  • ...and 2 more figures