Table of Contents
Fetching ...

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

Shaozuo Zhang, Ambuj Mehrish, Yingting Li, Soujanya Poria

TL;DR

This paper tackles expressive speech synthesis by introducing prompt-based emotion control guided by large language models (LLMs) to manipulate prosody while preserving linguistic content. It extends FastSpeech 2 with an Emotion Encoder and an Intensity Encoder, and uses GPT-4 prompting to adjust prosodic features at both utterance and word levels, enabling multi-speaker expressiveness. The approach demonstrates improved emotion classification accuracy and subjective expressiveness without compromising audio quality, as shown by objective metrics and MOS/PIR studies. The results suggest a viable path toward more natural and varied expressive TTS suitable for applications like audiobooks and virtual assistants, with future work in multilinguality, robustness, and real-time deployment.

Abstract

Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.

PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

TL;DR

This paper tackles expressive speech synthesis by introducing prompt-based emotion control guided by large language models (LLMs) to manipulate prosody while preserving linguistic content. It extends FastSpeech 2 with an Emotion Encoder and an Intensity Encoder, and uses GPT-4 prompting to adjust prosodic features at both utterance and word levels, enabling multi-speaker expressiveness. The approach demonstrates improved emotion classification accuracy and subjective expressiveness without compromising audio quality, as shown by objective metrics and MOS/PIR studies. The results suggest a viable path toward more natural and varied expressive TTS suitable for applications like audiobooks and virtual assistants, with future work in multilinguality, robustness, and real-time deployment.

Abstract

Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.
Paper Structure (14 sections, 1 equation, 4 figures, 1 table)

This paper contains 14 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of proposed expressive speech generation framework composed of $4$ modules: TTS backbone based on FS2(yellow-dotted), HuBERT for Emotion Encoder(red-dotted), HuBERT for Intensity Encoder(purple-dotted), GPT-4 Prompting for prosody control(green-dotted). FE: Feature Extractor, CLS: Classification Head, REG: Regression Head. LRF: Learned Rank Function
  • Figure 2: Introduction to Prompt Control: The scaling factors suggested by the LLM (shown in red and blue tables) directly affect the Variance Adaptor
  • Figure 3: Perceptual Intensity Ranking. H,M,L: High, Medium and Low intensities.
  • Figure 4: t-SNE of embeddings for the ESD validation set. (a) Intensity embeddings computed using the relative ranking function, $r(.)$. (b) Joint emotion and intensity embeddings.