Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari
TL;DR
The paper tackles the challenge of expressive, zero-shot emotional TTS by enabling phoneme-level emotion control while disentangling timbre from emotion. It extends FastSpeech 2 with a Style Encoder consisting of a global Timbre Extractor and a phoneme-aware Emotion Extractor, whose outputs are fused via cross-attention to produce $F_{ ext{emotion}}$ synchronized to each phoneme and modulated by $F_{ ext{timbre}}$; mutual information minimization via MINE, along with explicit emotion and speaker supervision, enforces robust disentanglement. A two-stage training regime first learns a neutral encoder, then jointly trains the Style Encoder with MI-based objectives and extra prosody losses, achieving state-of-the-art style similarity without sacrificing naturalness. Experiments on the ESD dataset show improvements over GST, StyleSpeech, MIST, and DC Comix TTS in MOS, SMOS, MCD, and UAA, with t-SNE visualizations confirming well-separated emotion clusters. The approach advances high-fidelity, controllable emotional TTS and provides a solid foundation for extension to multimodal or diffusion-based backbones in future work.
Abstract
Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.
