Table of Contents
Fetching ...

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

Sho Inoue, Kun Zhou, Shuai Wang, Haizhou Li

TL;DR

The paper tackles the difficulty of controlling emotion in neural TTS by introducing a hierarchical emotion distribution (ED) that captures emotion intensity at phoneme, word, and utterance levels. It predicts ED from text using a BERT-based linguistic encoder and a dedicated ED predictor embedded in a FastSpeech2-style variance adaptor, with ED supervision derived from a Hierarchical ED Extractor operating on ground-truth speech. The approach enables quantifiable, runtime emotion control across multiple linguistic granularities and demonstrates improvements in speech quality, expressiveness, and controllability on Blizzard Challenge 2013 and ESD datasets. This work advances emotion-aware TTS by linking semantic content with nuanced emotional prosody and enabling user-driven emotion manipulation at fine granularity, with practical implications for human–computer interaction and expressive speech synthesis.

Abstract

It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.

Hierarchical Emotion Prediction and Control in Text-to-Speech Synthesis

TL;DR

The paper tackles the difficulty of controlling emotion in neural TTS by introducing a hierarchical emotion distribution (ED) that captures emotion intensity at phoneme, word, and utterance levels. It predicts ED from text using a BERT-based linguistic encoder and a dedicated ED predictor embedded in a FastSpeech2-style variance adaptor, with ED supervision derived from a Hierarchical ED Extractor operating on ground-truth speech. The approach enables quantifiable, runtime emotion control across multiple linguistic granularities and demonstrates improvements in speech quality, expressiveness, and controllability on Blizzard Challenge 2013 and ESD datasets. This work advances emotion-aware TTS by linking semantic content with nuanced emotional prosody and enabling user-driven emotion manipulation at fine granularity, with practical implications for human–computer interaction and expressive speech synthesis.

Abstract

It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
Paper Structure (15 sections, 1 equation, 1 figure, 3 tables)

This paper contains 15 sections, 1 equation, 1 figure, 3 tables.

Figures (1)

  • Figure 1: System diagrams of (a) Model architecture; (b) Variance adaptor integrating a Hierarchical Emotion Distribution (ED) predictor; (c) Emotion control diagram; (d) An example of emotion control.