DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing
Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah
TL;DR
DubWise tackles lip-sync misalignment in dubbing when text or language changes by introducing a video-guided, duration-controllable TTS built on a GPT2-based multilingual backbone. It fuses linguistic tokens, a reference speaker embedding from a voice cloning network, and lip-region video features via cross-modal attention, supplemented by VideoCLIP context, to condition speech generation with $p(S|C_t,F_{lip}) = \prod_{l=1}^{L} p(S_l|S_{<l},C_t,F_{lip})$. A joint loss combining $CE_{audio}$, $CE_{text}$, and $duration\_loss$ guides duration alignment, while training focuses on cross-attention and transposed-convolution layers; the HiFi-GAN vocoder converts latent representations to audio. Evaluations on Lip2Wav-Chemistry and LRS2 demonstrate improved lip-sync, intelligibility, and cross-lingual dubbing quality over state-of-the-art baselines in non-parallel and cross-lingual scenarios, underscoring the practical impact of video-guided duration control for multimodal TTS.
Abstract
Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e., non-parallel) and the different language, different text (i.e., cross-lingual) scenarios.
