ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis
Haobin Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang
TL;DR
ED-TTS tackles the limitation of single-scale (utterance-level) emotion control in TTS by introducing multi-scale emotion modeling that fuses utterance-level SER with frame-level SED. It leverages a diffusion-based TTS framework with a multi-scale style encoder and cross-domain SED to generate frame-level soft emotion labels for supervision, enabling expressive and temporally localized prosody. The approach uses Local Multi-layer MMD to align SED distributions across domains, improving cross-domain SED accuracy and supervision quality. Experimental results demonstrate superior audio quality and emotion expressiveness compared with baselines, with ablations confirming the importance of frame-level supervision and cross-domain training. Overall, ED-TTS provides a practical pathway to more natural and controllable emotional speech synthesis using diffusion models.
Abstract
Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
