Table of Contents
Fetching ...

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

Neha Sahipjohn, Ashishkumar Gudmalwar, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah

TL;DR

DubWise tackles lip-sync misalignment in dubbing when text or language changes by introducing a video-guided, duration-controllable TTS built on a GPT2-based multilingual backbone. It fuses linguistic tokens, a reference speaker embedding from a voice cloning network, and lip-region video features via cross-modal attention, supplemented by VideoCLIP context, to condition speech generation with $p(S|C_t,F_{lip}) = \prod_{l=1}^{L} p(S_l|S_{<l},C_t,F_{lip})$. A joint loss combining $CE_{audio}$, $CE_{text}$, and $duration\_loss$ guides duration alignment, while training focuses on cross-attention and transposed-convolution layers; the HiFi-GAN vocoder converts latent representations to audio. Evaluations on Lip2Wav-Chemistry and LRS2 demonstrate improved lip-sync, intelligibility, and cross-lingual dubbing quality over state-of-the-art baselines in non-parallel and cross-lingual scenarios, underscoring the practical impact of video-guided duration control for multimodal TTS.

Abstract

Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e., non-parallel) and the different language, different text (i.e., cross-lingual) scenarios.

DubWise: Video-Guided Speech Duration Control in Multimodal LLM-based Text-to-Speech for Dubbing

TL;DR

DubWise tackles lip-sync misalignment in dubbing when text or language changes by introducing a video-guided, duration-controllable TTS built on a GPT2-based multilingual backbone. It fuses linguistic tokens, a reference speaker embedding from a voice cloning network, and lip-region video features via cross-modal attention, supplemented by VideoCLIP context, to condition speech generation with . A joint loss combining , , and guides duration alignment, while training focuses on cross-attention and transposed-convolution layers; the HiFi-GAN vocoder converts latent representations to audio. Evaluations on Lip2Wav-Chemistry and LRS2 demonstrate improved lip-sync, intelligibility, and cross-lingual dubbing quality over state-of-the-art baselines in non-parallel and cross-lingual scenarios, underscoring the practical impact of video-guided duration control for multimodal TTS.

Abstract

Audio-visual alignment after dubbing is a challenging research problem. To this end, we propose a novel method, DubWise Multi-modal Large Language Model (LLM)-based Text-to-Speech (TTS), which can control the speech duration of synthesized speech in such a way that it aligns well with the speakers lip movements given in the reference video even when the spoken text is different or in a different language. To accomplish this, we propose to utilize cross-modal attention techniques in a pre-trained GPT-based TTS. We combine linguistic tokens from text, speaker identity tokens via a voice cloning network, and video tokens via a proposed duration controller network. We demonstrate the effectiveness of our system on the Lip2Wav-Chemistry and LRS2 datasets. Also, the proposed method achieves improved lip sync and naturalness compared to the SOTAs for the same language but different text (i.e., non-parallel) and the different language, different text (i.e., cross-lingual) scenarios.
Paper Structure (7 sections, 4 equations, 3 figures, 3 tables)

This paper contains 7 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Proposed Method: Tokenized reference-speaker audio and text form the model's prompt (ground truth audio included during training only). Lip region video is fed through cross-attention. HiFi-GAN generates speech from the output.
  • Figure 2: The subjective evaluation of proposed DubWise approach and baseline methods. We obtained $p$-value $<$ 0.004
  • Figure 3: Spectrographic analysis of the reference English speech, and corresponding translated Hindi synthesized speech using DubWise and different baselines. Here, reference English sentence is "I have to break six moles of carbon hydrogen bonds, because there's two carbon hydrogen bonds and each acetylene."