Table of Contents
Fetching ...

Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

Rui Liu, Yuan Zhao, Zhenqi Jia

TL;DR

A new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing is proposed, termed Authentic-Dubber, which contains three novel mechanisms that enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness.

Abstract

The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.

Towards Authentic Movie Dubbing with Retrieve-Augmented Director-Actor Interaction Learning

TL;DR

A new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing is proposed, termed Authentic-Dubber, which contains three novel mechanisms that enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness.

Abstract

The automatic movie dubbing model generates vivid speech from given scripts, replicating a speaker's timbre from a brief timbre prompt while ensuring lip-sync with the silent video. Existing approaches simulate a simplified workflow where actors dub directly without preparation, overlooking the critical director-actor interaction. In contrast, authentic workflows involve a dynamic collaboration: directors actively engage with actors, guiding them to internalize the context cues, specifically emotion, before performance. To address this issue, we propose a new Retrieve-Augmented Director-Actor Interaction Learning scheme to achieve authentic movie dubbing, termed Authentic-Dubber, which contains three novel mechanisms: (1) We construct a multimodal Reference Footage library to simulate the learning footage provided by directors. Note that we integrate Large Language Models (LLMs) to achieve deep comprehension of emotional representations across multimodal signals. (2) To emulate how actors efficiently and comprehensively internalize director-provided footage during dubbing, we propose an Emotion-Similarity-based Retrieval-Augmentation strategy. This strategy retrieves the most relevant multimodal information that aligns with the target silent video. (3) We develop a Progressive Graph-based speech generation approach that incrementally incorporates the retrieved multimodal emotional knowledge, thereby simulating the actor's final dubbing process. The above mechanisms enable the Authentic-Dubber to faithfully replicate the authentic dubbing workflow, achieving comprehensive improvements in emotional expressiveness. Both subjective and objective evaluations on the V2C Animation benchmark dataset validate the effectiveness. The code and demos are available at https://github.com/AI-S2-Lab/Authentic-Dubber.

Paper Structure

This paper contains 23 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a) Previous models rely solely on cross-modal modeling of the target utterance to generate speech, which results in limited emotional expressiveness. (b) Our method enables expressive dubbing through three mechanisms: Multimodal Reference Footage Construction, Emotion-Similarity-based Retrieval-Augmentation, and Progressive Graph-based Speech Generation.
  • Figure 2: The proposed Authentic-Dubber consists of Multimodal Reference Footage Construction, Emotion-Similarity-based Retrieval-Augmentation, and Progressive Graph-based Speech Generation. (* means that the node's initial vector representation is initialized from the immediately preceding graph.)
  • Figure 3: The visualization of the mel-spectrograms of ground truth (GT) and synthesized speech obtained by different dubbing baselines, and orange bounding boxes are used to highlight the details in speech.
  • Figure 4: EMO-ACC of speech generated by Authentic-Dubber under Speaker-Agnostic and Speaker-Specific retrieval settings with varying Top-$K$ values.
  • Figure 5: The figure illustrates the emotion accuracy (EMO-ACC) of speech generated by our proposed Authentic-Dubber under different retrieval dataset scales.
  • ...and 1 more figures