Table of Contents
Fetching ...

Controlling Emotion in Text-to-Speech with Natural Language Prompts

Thomas Bott, Florian Lux, Ngoc Thang Vu

TL;DR

The paper addresses the controllability limitation in text-to-speech by enabling emotion transfer through natural language prompts. It introduces a prompt-conditioned TTS architecture that fuses emotion prompt embeddings with speaker identity via a squeeze-and-excitation mechanism and conditional layernorm, trained on merged emotional speech and text data with a large pool of prompts to promote generalization. A two-stage curriculum training regime first builds baseline speech quality and then concentrates on learning robust emotion-prompt mappings to generalize to unseen prompts. Objective metrics show accurate emotion transfer and strong speaker identity preservation, while subjective evaluations indicate competitive naturalness; the approach offers intuitive, scalable control for expressive TTS and is released as open-source.

Abstract

In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.

Controlling Emotion in Text-to-Speech with Natural Language Prompts

TL;DR

The paper addresses the controllability limitation in text-to-speech by enabling emotion transfer through natural language prompts. It introduces a prompt-conditioned TTS architecture that fuses emotion prompt embeddings with speaker identity via a squeeze-and-excitation mechanism and conditional layernorm, trained on merged emotional speech and text data with a large pool of prompts to promote generalization. A two-stage curriculum training regime first builds baseline speech quality and then concentrates on learning robust emotion-prompt mappings to generalize to unseen prompts. Objective metrics show accurate emotion transfer and strong speaker identity preservation, while subjective evaluations indicate competitive naturalness; the approach offers intuitive, scalable control for expressive TTS and is released as open-source.

Abstract

In recent years, prompting has quickly become one of the standard ways of steering the outputs of generative machine learning models, due to its intuitive use of natural language. In this work, we propose a system conditioned on embeddings derived from an emotionally rich text that serves as prompt. Thereby, a joint representation of speaker and prompt embeddings is integrated at several points within a transformer-based architecture. Our approach is trained on merged emotional speech and text datasets and varies prompts in each training iteration to increase the generalization capabilities of the model. Objective and subjective evaluation results demonstrate the ability of the conditioned synthesis system to accurately transfer the emotions present in a prompt to speech. At the same time, precise tractability of speaker identities as well as overall high speech quality and intelligibility are maintained.
Paper Structure (16 sections, 3 figures, 5 tables)

This paper contains 16 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Spectrograms with pitch contour for the same text, synthesized by our proposed system given two different emotional prompts. On the left the underlying emotion is neutral ("That's ok.") and on the right it is surprise ("Oh, really?").
  • Figure 2: Architecture of the prompt conditioned TTS system. Green components handle the integration of speaker and prompt embedding. $+$ indicates concatenation. The loss functions with which the components in this system are optimized are marked in orange.
  • Figure 3: Results of speech emotion recognition in terms of relative frequency for predicted emotion labels opposed to underlying ones. For Prompt Conditioned Same input text is the same as prompt while Prompt Conditioned Other uses different prompts. Emotion labels are abbreviated as follows: a(nger), j(oy), n(eutral), sa(dness), su(rprise).