Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
Kun Zhou, You Zhang, Dianwen Ng, Shengkui Zhao, Hao Wang, Bin Ma
TL;DR
This paper tackles the limitation of fixed emotion labels in TTS by introducing a language model-based TTS that uses continuous Pleasure, Arousal, and Dominance ($P$, $A$, $D$) dimensions to control emotional expression. It builds an emotional dimension predictor that maps categorical emotions to PAD anchors via anchored dimensionality reduction and WavLM features, guiding an autoregressive LM-based TTS without requiring explicit emotion labels for training. In inference, emotions can be cloned from a reference prompt or freely controlled by the user, with a flow-matching module and HiFi-GAN vocoder producing natural, expressive speech. Experiments on LibriTTS and ESD demonstrate improved naturalness and emotional intelligibility, showing broad, fine-grained affective coverage beyond training data.
Abstract
Emotional text-to-speech (TTS) systems sturggle to capture the full spectrum of human emotions due to the inherent complexity of emotional expressions and the limited coverage of existing emotion labels. To address this, we propose a language model-based TTS framework that synthesizes speech across a broad range of emotional styles. Our approach enables flexible user control along three continuous dimensions - pleasure, arousal, and dominance (PAD). To enable this, we train an emotional dimension predictor that maps categorical emotion labels in speech datasets into the PAD space, grounded in established psychological research. Importantly, while the emotional dimension predictor leverages categorical labels, the TTS framework itself does not require explict emotion labels during training. Objective and subjective evaluations demonstrate that our framework effectively generates more expressive emotional styles and enhances both naturalness and diversity compared to baselines.
