Table of Contents
Fetching ...

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

Zhedong Zhang, Liang Li, Chenggang Yan, Chunshan Liu, Anton van den Hengel, Yuankai Qi

TL;DR

The paper tackles movie dubbing, where aligning prosody with visual performance while preserving speaker voice is challenging due to small, noisy datasets. It proposes a two-stage framework: first, prosody-enhanced acoustic pre-training to strengthen acoustic modeling on prosody-rich data; then acoustic-disentangled prosody adapting that freezes the acoustic system and models script prosody and dubbing style via a prosodic text encoder, a prosodic style encoder with diffusion, and in-domain emotion analysis, guided by lip motion for timing. Key contributions include the introduction of a Prosodic Text BERT Encoder, a Prosodic Style Diffusion module, and In-Domain Emotion Analysis within a two-stage training regime, with extensive evaluations on V2C-Animation and GRID showing state-of-the-art results in both objective and subjective metrics. The work improves dubbing quality and robustness to visual-domain shifts, and provides public demos, highlighting practical impact for film post-production and AI-assisted media workflows.

Abstract

Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at https://zzdoog.github.io/ProDubber/.

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

TL;DR

The paper tackles movie dubbing, where aligning prosody with visual performance while preserving speaker voice is challenging due to small, noisy datasets. It proposes a two-stage framework: first, prosody-enhanced acoustic pre-training to strengthen acoustic modeling on prosody-rich data; then acoustic-disentangled prosody adapting that freezes the acoustic system and models script prosody and dubbing style via a prosodic text encoder, a prosodic style encoder with diffusion, and in-domain emotion analysis, guided by lip motion for timing. Key contributions include the introduction of a Prosodic Text BERT Encoder, a Prosodic Style Diffusion module, and In-Domain Emotion Analysis within a two-stage training regime, with extensive evaluations on V2C-Animation and GRID showing state-of-the-art results in both objective and subjective metrics. The work improves dubbing quality and robustness to visual-domain shifts, and provides public demos, highlighting practical impact for film post-production and AI-assisted media workflows.

Abstract

Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands the model bridge character performances and complicated prosody structures to build a high-quality video-synchronized dubbing track. The limited scale of movie dubbing datasets, along with the background noise inherent in audio data, hinder the acoustic modeling performance of trained models. To address these issues, we propose an acoustic-prosody disentangled two-stage method to achieve high-quality dubbing generation with precise prosody alignment. First, we propose a prosody-enhanced acoustic pre-training to develop robust acoustic modeling capabilities. Then, we freeze the pre-trained acoustic system and design a disentangled framework to model prosodic text features and dubbing style while maintaining acoustic quality. Additionally, we incorporate an in-domain emotion analysis module to reduce the impact of visual domain shifts across different movies, thereby enhancing emotion-prosody alignment. Extensive experiments show that our method performs favorably against the state-of-the-art models on two primary benchmarks. The demos are available at https://zzdoog.github.io/ProDubber/.

Paper Structure

This paper contains 23 sections, 14 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Illustration of the V2C task. (b) Illustration of the proposed acoustic pre-training stage (Stage i@) and prosody adapting stage (Stage ii@), which aim to generate dubbing with high acoustic quality and aligned prosody.
  • Figure 2: The main architecture of the proposed method. In the Prosody-Enhanced Acoustic Pre-training stage (Section \ref{['PEAP']}), we pre-train the acoustic system using a prosody-enhanced text-speech corpus. In the Acoustic-Disentangled Prosody Adapting stage (Section \ref{['DPA']}), we freeze the acoustic system and employ a disentangled framework to bridge the prosody of dubbing with the character's performance using In-Domain Emotion Analysis (Section \ref{['IDEA']}), thus generating dubbing with aligned prosody and maintain high acoustic quality.
  • Figure 3: The visualization of the mel-spectrograms of ground truth and synthesized dubbing by different models. The red and white bounding boxes highlight regions where different models exhibit significant differences in audio quality and pronunciation details.