Table of Contents
Fetching ...

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

Haozhe Chen, Run Chen, Julia Hirschberg

TL;DR

EmoKnob addresses the lack of fine-grained emotion control in TTS by leveraging foundation voice cloning models to embed arbitrary emotions with a controllable intensity using a few-shot demonstration. It computes an emotion direction vector $v_e$ from paired neutral and emotional samples and applies $u_{s,e} = u_s + \alpha \cdot v_e$ to the speaker embedding, enabling open-ended emotion control via two strategies: a synthetic data-based method and a transcript retrieval-based method. The framework introduces rigorous evaluation metrics for faithfulness and recognizability, and experimental results show that EmoKnob achieves faithful, recognizable emotions and surpasses commercial TTS in emotion expressiveness while preserving $WER$ and speaker identity. EmoKnob demonstrates strong synergy with evolving foundation speech models and offers practical routes to control nuanced emotions like charisma and empathy with few-shot samples.

Abstract

While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

TL;DR

EmoKnob addresses the lack of fine-grained emotion control in TTS by leveraging foundation voice cloning models to embed arbitrary emotions with a controllable intensity using a few-shot demonstration. It computes an emotion direction vector from paired neutral and emotional samples and applies to the speaker embedding, enabling open-ended emotion control via two strategies: a synthetic data-based method and a transcript retrieval-based method. The framework introduces rigorous evaluation metrics for faithfulness and recognizability, and experimental results show that EmoKnob achieves faithful, recognizable emotions and surpasses commercial TTS in emotion expressiveness while preserving and speaker identity. EmoKnob demonstrates strong synergy with evolving foundation speech models and offers practical routes to control nuanced emotions like charisma and empathy with few-shot samples.

Abstract

While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
Paper Structure (22 sections, 2 equations, 4 figures, 6 tables)

This paper contains 22 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Fine-grained emotion control with EmoKnob. While existing TTS and voice cloning frameworks lack the option for users to control emotions in speech, our framework allows users to embed arbitrary emotion with a specified intensity in speech with few-shot samples. This framework allows us to propose two methods for controlling emotions based on open-ended text descriptions of emotions.
  • Figure 2: EmoKnob's few-shot emotion control pipeline. EmoKnob first extracts an emotion direction vector in speaker embedding space of pre-trained foundation voice cloning models with a pair of neutral and emotional sample. Then, EmoKnob manipulates the reference speaker's embedding with the obtained emotion direction vector and a specified emotion strength to embed the emotion into speech.
  • Figure 3: EmoKnob enables emotion control with open-ended text descriptions of emotion. Based on recent advances in LLMs and EmoKnob's capability of applying emotion control with few-shot samples, we propose two methods that bypass the data insuffiency problem in emotional speech and embed emotions described by open-ended text descriptions into speech.
  • Figure 4: Ablation results measuring SIM and WER with varying shot number and emotion strength.