Table of Contents
Fetching ...

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

Rendi Chevi, Alham Fikri Aji

TL;DR

Daisy-TTS tackles the limited expressivity of emotional speech synthesis by leveraging a structural model of emotion and learning emotionally-separable prosody embeddings. The approach uses a Grad-TTS backbone with a prosody encoder and an emotion discriminator to create embeddings that can be decomposed and manipulated to produce primary and secondary emotions, as well as intensity and polarity control. Empirical evaluations with MOS and emotion perceivability demonstrate improved naturalness and recognizability over a baseline, and ablations reveal the critical role of emotion separability and the discriminator. This work advances expressive TTS by enabling a wider, controllable spectrum of emotion through prosody-based representations, with implications for more natural and adaptable synthetic speech while acknowledging ethical considerations and language-generalization limitations.

Abstract

We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.

Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition

TL;DR

Daisy-TTS tackles the limited expressivity of emotional speech synthesis by leveraging a structural model of emotion and learning emotionally-separable prosody embeddings. The approach uses a Grad-TTS backbone with a prosody encoder and an emotion discriminator to create embeddings that can be decomposed and manipulated to produce primary and secondary emotions, as well as intensity and polarity control. Empirical evaluations with MOS and emotion perceivability demonstrate improved naturalness and recognizability over a baseline, and ablations reveal the critical role of emotion separability and the discriminator. This work advances expressive TTS by enabling a wider, controllable spectrum of emotion through prosody-based representations, with implications for more natural and adaptable synthetic speech while acknowledging ethical considerations and language-generalization limitations.

Abstract

We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.
Paper Structure (35 sections, 6 equations, 7 figures, 5 tables)

This paper contains 35 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Emotionally-separable prosody embeddings learned from our proposed model, Daisy-TTS. Emotions bordered in black denote primary emotions, while ones bordered in white denote secondary emotions derived from the mixture of primary ones.
  • Figure 2: Visual Representation of the Structural Model of Emotions.
  • Figure 3: Systemic Overview of Daisy-TTS. Emotionally-separable prosody embeddings were learned from a set of speech features and used to condition a TTS backbone model. To simulate a wider range of emotion characteristics, such as intensity, polarity, and mixture of emotions, an embedding decomposition was applied to the learned embeddings.
  • Figure 4: Result of emotion perception test for different intensity-level of primary emotions.
  • Figure 5: (Left Column) Prosody embeddings from training Daisy-TTS without emotion discriminator; (Right Column) with emotion discriminator. Training without emotion discriminator discouraged emotion separability.
  • ...and 2 more figures