EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Deok-Hyeon Cho; Hyung-Seok Oh; Seung-Bin Kim; Sang-Hoon Lee; Seong-Whan Lee

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

TL;DR

EmoSphere-TTS is proposed, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech and a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics.

Abstract

Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech.

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 3 figures, 2 tables)

This paper contains 16 sections, 6 equations, 3 figures, 2 tables.

Introduction
EmoSphere-TTS
Emotional style and intensity modeling
AVD encoder
Cartesian-spherical transformation
Spherical emotion encoder
Dual conditional adversarial training
TTS model
Experiments and results
Experimental setup
Implementation details
Model performance
Emotion intensity controllability
Emotion style shift
Conclusion
...and 1 more sections

Figures (3)

Figure 1: Overall architecture of EmoSphere-TTS.
Figure 2: The pitch tendency according to emotion and intensity.
Figure 3: A pitch track of a sample demonstrating the effects of emotional style shift, where $A$, $V$, and $D$ represent arousal, valence, and dominance, respectively. The line color represents emotional intensity, red = 0.1, green = 0.5, and blue = 0.9.

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

TL;DR

Abstract

EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (3)