Table of Contents
Fetching ...

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Kun Zhou, You Zhang, Dianwen Ng, Shengkui Zhao, Hao Wang, Bin Ma

TL;DR

This paper tackles the limitation of fixed emotion labels in TTS by introducing a language model-based TTS that uses continuous Pleasure, Arousal, and Dominance ($P$, $A$, $D$) dimensions to control emotional expression. It builds an emotional dimension predictor that maps categorical emotions to PAD anchors via anchored dimensionality reduction and WavLM features, guiding an autoregressive LM-based TTS without requiring explicit emotion labels for training. In inference, emotions can be cloned from a reference prompt or freely controlled by the user, with a flow-matching module and HiFi-GAN vocoder producing natural, expressive speech. Experiments on LibriTTS and ESD demonstrate improved naturalness and emotional intelligibility, showing broad, fine-grained affective coverage beyond training data.

Abstract

Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

TL;DR

This paper tackles the limitation of fixed emotion labels in TTS by introducing a language model-based TTS that uses continuous Pleasure, Arousal, and Dominance (, , ) dimensions to control emotional expression. It builds an emotional dimension predictor that maps categorical emotions to PAD anchors via anchored dimensionality reduction and WavLM features, guiding an autoregressive LM-based TTS without requiring explicit emotion labels for training. In inference, emotions can be cloned from a reference prompt or freely controlled by the user, with a flow-matching module and HiFi-GAN vocoder producing natural, expressive speech. Experiments on LibriTTS and ESD demonstrate improved naturalness and emotional intelligibility, showing broad, fine-grained affective coverage beyond training data.

Abstract

Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.
Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: An overview of the proposed text-to-speech (TTS) framework with emotional dimension control, consisting of: (a) Emotional Dimension (ED) Predictor Training, and (b) Text-to-Speech Flow. The ED predictor is pre-trained on an emotional speech dataset to map emotional features to dimension representations via anchored dimensionality reduction. It then guides the autoregressive language model (LM) to predict acoustic details. 'P', 'A', and 'D' denote 'Pleasure', 'Arousal', and 'Dominance'. PAD values can be either inferred from prompt speech ('Emotion Cloning') or assigned by humans ('Emotion Control').
  • Figure 2: Statistical analysis of pitch and spectral flux for 9 emotions synthesized via ED Control in the proposed framework.
  • Figure 3: Mean Opinion Score for assessing emotional intelligibility ('E-MOS') of our proposed system for the emotion cloning task.
  • Figure 4: XAB test results for four synthesized emotion pairs, used to evaluate the intelligibility of perceived emotions and the effectiveness of emotion control in the proposed system.