PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control
Shaozuo Zhang, Ambuj Mehrish, Yingting Li, Soujanya Poria
TL;DR
This paper tackles expressive speech synthesis by introducing prompt-based emotion control guided by large language models (LLMs) to manipulate prosody while preserving linguistic content. It extends FastSpeech 2 with an Emotion Encoder and an Intensity Encoder, and uses GPT-4 prompting to adjust prosodic features at both utterance and word levels, enabling multi-speaker expressiveness. The approach demonstrates improved emotion classification accuracy and subjective expressiveness without compromising audio quality, as shown by objective metrics and MOS/PIR studies. The results suggest a viable path toward more natural and varied expressive TTS suitable for applications like audiobooks and virtual assistants, with future work in multilinguality, robustness, and real-time deployment.
Abstract
Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.
