Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Zhengyang Chen; Xuechen Liu; Erica Cooper; Junichi Yamagishi; Yanmin Qian

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, Yanmin Qian

TL;DR

This work tackles prompt-driven control of speaker characteristics in multi-speaker TTS by deriving prompts from listener impressions and decoupling the prompt-to-speaker mapping from the TTS backbone. It introduces a LoRA-tuned prompt encoder to map prompts to speaker embeddings and investigates two mapping strategies: a discriminative approach and a Flow Matching-based generative approach, including a hybrid that combines both. Experiments on the CSJ dataset show that LoRA substantially improves the prompt encoder, Flow Matching yields higher fidelity than discriminative methods, and their combination delivers the best overall performance in both objective (FAD) and subjective (MOS) evaluations. The proposed modular design facilitates integration with various pre-trained multi-speaker TTS systems and provides a data-efficient path to flexible, user-friendly speaker customization in TTS applications.

Abstract

This paper proposes a speech synthesis system that allows users to specify and control the acoustic characteristics of a speaker by means of prompts describing the speaker's traits of synthesized speech. Unlike previous approaches, our method utilizes listener impressions to construct prompts, which are easier to collect and align more naturally with everyday descriptions of speaker traits. We adopt the Low-rank Adaptation (LoRA) technique to swiftly tailor a pre-trained language model to our needs, facilitating the extraction of speaker-related traits from the prompt text. Besides, different from other prompt-driven text-to-speech (TTS) systems, we separate the prompt-to-speaker module from the multi-speaker TTS system, enhancing system flexibility and compatibility with various pre-trained multi-speaker TTS systems. Moreover, for the prompt-to-speaker characteristic module, we also compared the discriminative method and flow-matching based generative method and we found that combining both methods can help the system simultaneously capture speaker-related information from prompts better and generate speech with higher fidelity.

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

TL;DR

Abstract

Paper Structure (22 sections, 4 equations, 7 figures, 3 tables)

This paper contains 22 sections, 4 equations, 7 figures, 3 tables.

Introduction
Prompt-driven Speaker Generation
System Overview
Discriminative Method
Generative Method based on Flow Matching
Flow Matching Algorithm
Generate Speaker Representation based on Flow Matching
Experiment Setup
Dataset and Prompt Construction
Model Configuration
Evaluation Metric
Objective Evaluation
Subjective Evaluation
Results
Audio Fidelity and Naturalness Evaluation
...and 7 more sections

Figures (7)

Figure 1: Overview of our system. $\Tilde{e}$ and $\hat{e}$ are two types of outputs of the prompt encoder. Refer to Figure \ref{['fig:prompt2emb']} for more details.
Figure 2: The Prompt Encoder Design.
Figure 3: Prompt construction pipeline from listener impressions. The prompt is created using slot-filling techniques, with impression phrases filling the two slots indicated by brackets.
Figure 4: MOS score linear correlation visualization between synthesized speech and reference speech. The Discriminative (w/ LoRA) system is used to generate speech for seen speaker scenario.
Figure 5: t-SNE speaker embedding visualization for synthesized speech. Discriminative+Flow-Matching system is used in this figure. For each prompt, 500 audios are synthesized with the same context text.
...and 2 more figures

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

TL;DR

Abstract

Generating Speakers by Prompting Listener Impressions for Pre-trained Multi-Speaker Text-to-Speech Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (7)