VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin; Jeongsoo Choi; Puyuan Peng; Joon Son Chung; Tae-Hyun Oh; David Harwath

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng, Joon Son Chung, Tae-Hyun Oh, David Harwath

TL;DR

This work addresses automated video dubbing by extending Neural Codec Language Models (NCLMs) to condition speech synthesis on source audio, target text, and a target video. It introduces audio-visual adapters and AV fusion layers that embed lip and facial cues into the NCLM token space, enabling time-aligned, expressive speech via a Transformer decoder and Encodec vocoder. A new expressive dataset, CelebV-Dub, complements LRS3 to evaluate dubbing under real-world, emotional conditions. Empirical results show VoiceCraft-Dub achieves superior naturalness, intelligibility, and lip synchronization compared with baselines and approaches ground-truth quality, and its versatility is demonstrated through a video-to-speech extension. This work advances immersive dubbing and accessibility, offering a scalable, multimodal framework for high-fidelity, lip-synced speech synthesis conditioned on visible facial cues.

Abstract

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

TL;DR

Abstract

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)