Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Jingbei Li; Sipan Li; Ping Chen; Luwen Zhang; Yi Meng; Zhiyong Wu; Helen Meng; Qiao Tian; Yuping Wang; Yuxuan Wang

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Jingbei Li, Sipan Li, Ping Chen, Luwen Zhang, Yi Meng, Zhiyong Wu, Helen Meng, Qiao Tian, Yuping Wang, Yuxuan Wang

TL;DR

This work tackles automatic dubbing by addressing the gap in cross-lingual speaking style transfer across both global (utterance-level) and local (word-level) scales. It introduces a joint multitask framework with a shared bidirectional attention mechanism and an MST-FastSpeech 2 backbone to predict and render global style tokens $GST$ and local style sequences $LST$ across languages, enabling synthesis with transferred speaking styles. The approach is pretrained with GRL-based disentanglement and trained with a joint $MSE$ objective on $GST$ and $LST$ across both languages, then evaluated on a Chinese–English game-dubbing corpus, showing significant objective and subjective gains over baselines that transfer only duration or no style. The results demonstrate that modeling both global and local styles, and learning cross-directional transfer, yields more natural and expressive dubbed speech, with practical implications for film/game localization; limitations include corpus size and potential alignment with video dynamics.

Abstract

Automatic dubbing, which generates a corresponding version of the input speech in another language, could be widely utilized in many real-world scenarios such as video and game localization. In addition to synthesizing the translated scripts, automatic dubbing needs to further transfer the speaking style in the original language to the dubbed speeches to give audiences the impression that the characters are speaking in their native tongue. However, state-of-the-art automatic dubbing systems only model the transfer on duration and speaking rate, neglecting the other aspects in speaking style such as emotion, intonation and emphasis which are also crucial to fully perform the characters and speech understanding. In this paper, we propose a joint multi-scale cross-lingual speaking style transfer framework to simultaneously model the bidirectional speaking style transfer between languages at both global (i.e. utterance level) and local (i.e. word level) scales. The global and local speaking styles in each language are extracted and utilized to predicted the global and local speaking styles in the other language with an encoder-decoder framework for each direction and a shared bidirectional attention mechanism for both directions. A multi-scale speaking style enhanced FastSpeech 2 is then utilized to synthesize the predicted the global and local speaking styles to speech for each language. Experiment results demonstrate the effectiveness of our proposed framework, which outperforms a baseline with only duration transfer in both objective and subjective evaluations.

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

TL;DR

and local style sequences

across languages, enabling synthesis with transferred speaking styles. The approach is pretrained with GRL-based disentanglement and trained with a joint

objective on

and

across both languages, then evaluated on a Chinese–English game-dubbing corpus, showing significant objective and subjective gains over baselines that transfer only duration or no style. The results demonstrate that modeling both global and local styles, and learning cross-directional transfer, yields more natural and expressive dubbed speech, with practical implications for film/game localization; limitations include corpus size and potential alignment with video dynamics.

Abstract

Paper Structure (28 sections, 11 equations, 7 figures, 5 tables)

This paper contains 28 sections, 11 equations, 7 figures, 5 tables.

Introduction
Related work
Text-to-speech synthesis
Automatic dubbing
Contributions
Data observation
Subjective observation
Objective observation
Methodology
Multimodal multiscale feature extraction
Textural feature extraction
Speaking style extraction
Multimodal feature fusion
Joint multiscale cross-lingual speaking style transfer
Speaking style transfer at the global scale
...and 13 more sections

Figures (7)

Figure 1: Cross-lingual speaking style transfer between two languages at multiple scales.
Figure 2: Architecture of the proposed joint multiscale cross-lingual speaking style transfer framework.
Figure 3: Joint cross-lingual speaking style transfer at the global scale.
Figure 4: Joint cross-lingual speaking style transfer at the local scale.
Figure 5: Bidirectional attention mechanism in local speaking style transfer.
...and 2 more figures

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

TL;DR

Abstract

Joint Multi-scale Cross-lingual Speaking Style Transfer with Bidirectional Attention Mechanism for Automatic Dubbing

Authors

TL;DR

Abstract

Table of Contents

Figures (7)