Table of Contents
Fetching ...

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

Haobin Tang, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

TL;DR

ED-TTS tackles the limitation of single-scale (utterance-level) emotion control in TTS by introducing multi-scale emotion modeling that fuses utterance-level SER with frame-level SED. It leverages a diffusion-based TTS framework with a multi-scale style encoder and cross-domain SED to generate frame-level soft emotion labels for supervision, enabling expressive and temporally localized prosody. The approach uses Local Multi-layer MMD to align SED distributions across domains, improving cross-domain SED accuracy and supervision quality. Experimental results demonstrate superior audio quality and emotion expressiveness compared with baselines, with ablations confirming the importance of frame-level supervision and cross-domain training. Overall, ED-TTS provides a practical pathway to more natural and controllable emotional speech synthesis using diffusion models.

Abstract

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.

ED-TTS: Multi-Scale Emotion Modeling using Cross-Domain Emotion Diarization for Emotional Speech Synthesis

TL;DR

ED-TTS tackles the limitation of single-scale (utterance-level) emotion control in TTS by introducing multi-scale emotion modeling that fuses utterance-level SER with frame-level SED. It leverages a diffusion-based TTS framework with a multi-scale style encoder and cross-domain SED to generate frame-level soft emotion labels for supervision, enabling expressive and temporally localized prosody. The approach uses Local Multi-layer MMD to align SED distributions across domains, improving cross-domain SED accuracy and supervision quality. Experimental results demonstrate superior audio quality and emotion expressiveness compared with baselines, with ablations confirming the importance of frame-level supervision and cross-domain training. Overall, ED-TTS provides a practical pathway to more natural and controllable emotional speech synthesis using diffusion models.

Abstract

Existing emotional speech synthesis methods often utilize an utterance-level style embedding extracted from reference audio, neglecting the inherent multi-scale property of speech prosody. We introduce ED-TTS, a multi-scale emotional speech synthesis model that leverages Speech Emotion Diarization (SED) and Speech Emotion Recognition (SER) to model emotions at different levels. Specifically, our proposed approach integrates the utterance-level emotion embedding extracted by SER with fine-grained frame-level emotion embedding obtained from SED. These embeddings are used to condition the reverse process of the denoising diffusion probabilistic model (DDPM). Additionally, we employ cross-domain SED to accurately predict soft labels, addressing the challenge of a scarcity of fine-grained emotion-annotated datasets for supervising emotional TTS training.
Paper Structure (13 sections, 6 equations, 1 figure, 3 tables)

This paper contains 13 sections, 6 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The overview of ED-TTS and cross-domain training for SED. The color in waveforms denotes the predicted frame-level emotion labels by SED (e.g. red for non-neutral and blue for neutral). Extracter denotes CNN-based feature encoder of SED.