EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Haoyu Wang; Chunyu Qiang; Tianrui Wang; Cheng Gong; Qiuyu Liu; Yu Jiang; Xiaobao Wang; Chenyang Wang; Chen Zhang

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Haoyu Wang, Chunyu Qiang, Tianrui Wang, Cheng Gong, Qiuyu Liu, Yu Jiang, Xiaobao Wang, Chenyang Wang, Chen Zhang

TL;DR

This paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis, and shows that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech.

Abstract

Recent advancements in speech synthesis models, trained on extensive datasets, have demonstrated remarkable zero-shot capabilities. These models can control content, timbre, and emotion in generated speech based on prompt inputs. Despite these advancements, the choice of prompts significantly impacts the output quality, yet most existing selection schemes do not adequately address the control of emotional intensity. To address this question, this paper proposes a two-stage prompt selection strategy EmoPro, which is specifically designed for emotionally controllable speech synthesis. This strategy focuses on selecting highly expressive and high-quality prompts by evaluating them from four perspectives: emotional expression strength, speech quality, text-emotion consistency, and model generation performance. Experimental results show that prompts selected using the proposed method result in more emotionally expressive and engaging synthesized speech compared to those obtained through baseline. Audio samples and codes will be available at https://whyrrrrun.github.io/EmoPro/.

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

TL;DR

Abstract

Paper Structure (19 sections, 2 figures, 5 tables)

This paper contains 19 sections, 2 figures, 5 tables.

Introduction
Method
Overview
Static Selection
Pitch
Perceptual and Textual Selecting
Selecting with Performance under LM-based TTS Method
Dynamic Selection
Experiments
Data
Compared Methods
Test Metrics
Experimental Results
Evaluation on Different TTS Models
Range of Prompt Selection
...and 4 more sections

Figures (2)

Figure 1: The overview of EmoPro. It consists of two stages: a static selection stage and a dynamic selection stage. The static selection stage evaluates the intrinsic quality of the prompt and its performance in the specific LM-based model, while the dynamic selection stage chooses the most relevant prompt from $k$ prompts based on the synthesized text.
Figure 2: Mean and variance of emotional speech pitch: red indicates anger, blue indicates comfort, orange indicates sad, green indicates happy, and purple indicates surprised.

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

TL;DR

Abstract

EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (2)