Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction
Yuan Zhao, Rui Liu, Gaoxiang Cong
TL;DR
Automatic video dubbing remains challenging when expressing nuanced prosody due to underutilization of multiscale multimodal context and its interactions with the current sentence. The authors propose M2CI-Dubber, a two-encoder, context-interaction framework that extracts global and local context from video, text, and audio, and fuses it with the current sentence via an interaction-based graph attention network and a Context-Aware Adaptor within a HPM-Dubbing backbone. The work introduces Multiscale Feature Extraction, Interaction-based Multiscale Aggregation, and Interaction-based Multimodal Fusion to capture cross-modal prosody cues and their dynamic influence on dubbing expressiveness, achieving superior GPE, FFE, MOS-C, MOS-S, and AV sync on Context Chem. This approach advances expressive video dubbing by leveraging structured multiscale multimodal context interactions to better align prosody with lip motion and facial emotion, with practical implications for more natural dubbed media.
Abstract
Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.
