Table of Contents
Fetching ...

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

TL;DR

EmoSphere++ tackles emotion-controllable zero-shot TTS by introducing an emotion-adaptive spherical vector (EASV) grounded in the VAD space, enabling explicit control of emotion style and intensity without manual labels. A joint attribute style encoder fuses global and dimensional emotion cues with speaker information to support robust zero-shot transfer, while a conditional flow matching (CFM) decoder delivers high-quality expressive synthesis. A normalized orthogonality loss mitigates leakage between speaker and emotion embeddings, enhancing generalization to unseen speakers and emotions. Experimental results on ESD, MSP-Podcast, and IEMOCAP demonstrate superior naturalness, emotion transfer, and controllability compared to baselines, including in zero-shot conditions, with SVAS providing fine-grained emotion evaluation.

Abstract

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

TL;DR

EmoSphere++ tackles emotion-controllable zero-shot TTS by introducing an emotion-adaptive spherical vector (EASV) grounded in the VAD space, enabling explicit control of emotion style and intensity without manual labels. A joint attribute style encoder fuses global and dimensional emotion cues with speaker information to support robust zero-shot transfer, while a conditional flow matching (CFM) decoder delivers high-quality expressive synthesis. A normalized orthogonality loss mitigates leakage between speaker and emotion embeddings, enhancing generalization to unseen speakers and emotions. Experimental results on ESD, MSP-Podcast, and IEMOCAP demonstrate superior naturalness, emotion transfer, and controllability compared to baselines, including in zero-shot conditions, with SVAS providing fine-grained emotion evaluation.

Abstract

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

Paper Structure

This paper contains 40 sections, 13 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) Three-dimensional valence-arousal-dominance (VAD) cubes of emotions, where all emotional styles occur as derivative states of primary emotions. Emotional intensity control method is used for (b) conventional models and (c) the proposed model with consideration for emotional style.
  • Figure 2: Illustration of coordinate transformations: (a) Cartesian-to-spherical coordinate transformation cho24_interspeech and (b) Emotion-adaptive coordinate transformation. In (a), the center $M$ represents the central coordinates of the neutral state, whereas (b) introduces $M_{k}$, which serves as the representative center that reflects the distributions of both neutral and target emotions.
  • Figure 3: Training diagram of the EmoSphere++ framework. The framework consists of three main modules: the text encoder, the joint attribute style encoder, and the conditional flow matching (CFM) decoder. The right section illustrates the detailed structure of the joint attribute style encoder, which extracts global speaker, global emotion, and dimensional-driven emotion to form a joint attribute style embedding for emotional speech synthesis.
  • Figure 4: Run-time diagram of the proposed EmoSphere++ framework. We can manually control the emotion style and intensity via the dimensional-driven emotion of emotion style and intensity. We produce an emotional state as the derivative of primary emotions by assigning the appropriate angle and length to the spherical vector.
  • Figure 5: Pitch tendency track according to intensity for different emotions. Pitch values were calculated by averaging the synthesized speech for each intensity across all test sentences. Since the intensity of ground truth (GT) speech cannot be adjusted, the GT line represents the pitch tendency based on the emotion-adaptive spherical vector intensity labels across all test sentences, serving as a reference guideline.
  • ...and 5 more figures