Table of Contents
Fetching ...

DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance

Junjie Zheng, Zihao Chen, Chaofan Ding, Xinhan Di

TL;DR

DeepDubber addresses adaptive movie dubbing by integrating multimodal chain-of-thought reasoning with a conditioned speech generator. It introduces a two-stage framework where a multimodal LLM performs in-context CoT reasoning over video and subtitles to infer dubbing type and speaker attributes, followed by a diffusion-based speech generator that renders video-to-speech dubbing under multi-condition controls. A CoT annotated movie dubbing dataset is provided, and training combines supervised CoT and reinforcement learning with MPO and a loss decomposition including $L_p$, $L_q$, $L_g$, $L_f$, and $L_c$, plus a $\mathcal{L}_{dur}$ term for duration alignment. Empirical results on V2C-Animation and GRID benchmarks show improved SPK-SIM, EMO-SIM, and WER metrics, validating the method's superior lip-sync and prosody alignment and its potential to advance practical dubbing.

Abstract

Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.

DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance

TL;DR

DeepDubber addresses adaptive movie dubbing by integrating multimodal chain-of-thought reasoning with a conditioned speech generator. It introduces a two-stage framework where a multimodal LLM performs in-context CoT reasoning over video and subtitles to infer dubbing type and speaker attributes, followed by a diffusion-based speech generator that renders video-to-speech dubbing under multi-condition controls. A CoT annotated movie dubbing dataset is provided, and training combines supervised CoT and reinforcement learning with MPO and a loss decomposition including , , , , and , plus a term for duration alignment. Empirical results on V2C-Animation and GRID benchmarks show improved SPK-SIM, EMO-SIM, and WER metrics, validating the method's superior lip-sync and prosody alignment and its potential to advance practical dubbing.

Abstract

Current movie dubbing technology can generate the desired voice from a given speech prompt, ensuring good synchronization between speech and visuals while accurately conveying the intended emotions. However, in movie dubbing, key aspects such as adapting to different dubbing styles, handling dialogue, narration, and monologue effectively, and understanding subtle details like the age and gender of speakers, have not been well studied. To address this challenge, we propose a framework of multi-modal large language model. First, it utilizes multimodal Chain-of-Thought (CoT) reasoning methods on visual inputs to understand dubbing styles and fine-grained attributes. Second, it generates high-quality dubbing through large speech generation models, guided by multimodal conditions. Additionally, we have developed a movie dubbing dataset with CoT annotations. The evaluation results demonstrate a performance improvement over state-of-the-art methods across multiple datasets. In particular, for the evaluation metrics, the SPK-SIM and EMO-SIM increases from 82.48% to 89.74%, 66.24% to 78.88% for dubbing setting 2.0 on V2C Animation dataset, LSE-D and MCD-SL decreases from 14.79 to 14.63, 5.24 to 4.74 for dubbing setting 2.0 on Grid dataset, SPK-SIM increases from 64.03 to 83.42 and WER decreases from 52.69% to 23.20% for initial reasoning setting on proposed CoT-Movie-Dubbing dataset in the comparison with the state-of-the art models.

Paper Structure

This paper contains 19 sections, 18 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Curent Dubbing models congLearningDubMovies2023cong2024styledubbermultiscalestylelearningzhang2024from (Left). Proposed Dubbing Models (Right) For dubbing types and fine-grained attributes.
  • Figure 2: DeepDubber pipeline with multi-stage, multi-modal training.
  • Figure 3: Proposed dataset with multi-type annotations, including annotation for lips, faces, scene-type, speaker gender, speaker age, voice emotion.
  • Figure 4: The reasoning stages of movie scene type CoT annotations.
  • Figure 5: Visualization of speech samples generated by state-of-the-art models and our. The green rectangles highlight key regions that have significant differences in overall expressiveness.