Controlling Emotion in Text-to-Speech with Natural Language Prompts
Thomas Bott, Florian Lux, Ngoc Thang Vu
TL;DR
The paper addresses the controllability limitation in text-to-speech by enabling emotion transfer through natural language prompts. It introduces a prompt-conditioned TTS architecture that fuses emotion prompt embeddings with speaker identity via a squeeze-and-excitation mechanism and conditional layernorm, trained on merged emotional speech and text data with a large pool of prompts to promote generalization. A two-stage curriculum training regime first builds baseline speech quality and then concentrates on learning robust emotion-prompt mappings to generalize to unseen prompts. Objective metrics show accurate emotion transfer and strong speaker identity preservation, while subjective evaluations indicate competitive naturalness; the approach offers intuitive, scalable control for expressive TTS and is released as open-source.
Abstract
In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.
