EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

Deok-Hyeon Cho; Hyung-Seok Oh; Seung-Bin Kim; Seong-Whan Lee

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

TL;DR

EmoSphere++ tackles emotion-controllable zero-shot TTS by introducing an emotion-adaptive spherical vector (EASV) grounded in the VAD space, enabling explicit control of emotion style and intensity without manual labels. A joint attribute style encoder fuses global and dimensional emotion cues with speaker information to support robust zero-shot transfer, while a conditional flow matching (CFM) decoder delivers high-quality expressive synthesis. A normalized orthogonality loss mitigates leakage between speaker and emotion embeddings, enhancing generalization to unseen speakers and emotions. Experimental results on ESD, MSP-Podcast, and IEMOCAP demonstrate superior naturalness, emotion transfer, and controllability compared to baselines, including in zero-shot conditions, with SVAS providing fine-grained emotion evaluation.

Abstract

Emotional text-to-speech (TTS) technology has achieved significant progress in recent years; however, challenges remain owing to the inherent complexity of emotions and limitations of the available emotional speech datasets and models. Previous studies typically relied on limited emotional speech datasets or required extensive manual annotations, restricting their ability to generalize across different speakers and emotional styles. In this paper, we present EmoSphere++, an emotion-controllable zero-shot TTS model that can control emotional style and intensity to resemble natural human speech. We introduce a novel emotion-adaptive spherical vector that models emotional style and intensity without human annotation. Moreover, we propose a multi-level style encoder that can ensure effective generalization for both seen and unseen speakers. We also introduce additional loss functions to enhance the emotion transfer performance for zero-shot scenarios. We employ a conditional flow matching-based decoder to achieve high-quality and expressive emotional TTS in a few sampling steps. Experimental results demonstrate the effectiveness of the proposed framework.

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

TL;DR

Abstract

EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)