Table of Contents
Fetching ...

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

Jianing Yang, Sheng Li, Takahiro Shinozaki, Yuki Saito, Hiroshi Saruwatari

TL;DR

The paper tackles the challenge of expressive, zero-shot emotional TTS by enabling phoneme-level emotion control while disentangling timbre from emotion. It extends FastSpeech 2 with a Style Encoder consisting of a global Timbre Extractor and a phoneme-aware Emotion Extractor, whose outputs are fused via cross-attention to produce $F_{ ext{emotion}}$ synchronized to each phoneme and modulated by $F_{ ext{timbre}}$; mutual information minimization via MINE, along with explicit emotion and speaker supervision, enforces robust disentanglement. A two-stage training regime first learns a neutral encoder, then jointly trains the Style Encoder with MI-based objectives and extra prosody losses, achieving state-of-the-art style similarity without sacrificing naturalness. Experiments on the ESD dataset show improvements over GST, StyleSpeech, MIST, and DC Comix TTS in MOS, SMOS, MCD, and UAA, with t-SNE visualizations confirming well-separated emotion clusters. The approach advances high-fidelity, controllable emotional TTS and provides a solid foundation for extension to multimodal or diffusion-based backbones in future work.

Abstract

Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

TL;DR

The paper tackles the challenge of expressive, zero-shot emotional TTS by enabling phoneme-level emotion control while disentangling timbre from emotion. It extends FastSpeech 2 with a Style Encoder consisting of a global Timbre Extractor and a phoneme-aware Emotion Extractor, whose outputs are fused via cross-attention to produce synchronized to each phoneme and modulated by ; mutual information minimization via MINE, along with explicit emotion and speaker supervision, enforces robust disentanglement. A two-stage training regime first learns a neutral encoder, then jointly trains the Style Encoder with MI-based objectives and extra prosody losses, achieving state-of-the-art style similarity without sacrificing naturalness. Experiments on the ESD dataset show improvements over GST, StyleSpeech, MIST, and DC Comix TTS in MOS, SMOS, MCD, and UAA, with t-SNE visualizations confirming well-separated emotion clusters. The approach advances high-fidelity, controllable emotional TTS and provides a solid foundation for extension to multimodal or diffusion-based backbones in future work.

Abstract

Current emotional Text-To-Speech (TTS) and style transfer methods rely on reference encoders to control global style or emotion vectors, but do not capture nuanced acoustic details of the reference speech. To this end, we propose a novel emotional TTS method that enables fine-grained phoneme-level emotion embedding prediction while disentangling intrinsic attributes of the reference speech. The proposed method employs a style disentanglement method to guide two feature extractors, reducing mutual information between timbre and emotion features, and effectively separating distinct style components from the reference speech. Experimental results demonstrate that our method outperforms baseline TTS systems in generating natural and emotionally rich speech. This work highlights the potential of disentangled and fine-grained representations in advancing the quality and flexibility of emotional TTS systems.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed TTS model and its main modules. Left: The end-to-end TTS pipeline adopts the FastSpeech 2 backbone; a Style Encoder is inserted after the Phoneme Encoder, followed by the Variance Adaptor and the Mel-spectrogram Decoder. Center: Architecture of the Style Encoder, in which separate Timbre and Emotion Extractor are combined through a residual Add & Norm operation. Right: Detailed design of the Emotion Extractor, generates phoneme-level emotion embeddings.
  • Figure 2: Proposed architecture for timber-emotion disentanglement based on mutual information minimization
  • Figure 3: T-SNE visualization of emotion embeddings