EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control
Haozhe Chen, Run Chen, Julia Hirschberg
TL;DR
EmoKnob addresses the lack of fine-grained emotion control in TTS by leveraging foundation voice cloning models to embed arbitrary emotions with a controllable intensity using a few-shot demonstration. It computes an emotion direction vector $v_e$ from paired neutral and emotional samples and applies $u_{s,e} = u_s + \alpha \cdot v_e$ to the speaker embedding, enabling open-ended emotion control via two strategies: a synthetic data-based method and a transcript retrieval-based method. The framework introduces rigorous evaluation metrics for faithfulness and recognizability, and experimental results show that EmoKnob achieves faithful, recognizable emotions and surpasses commercial TTS in emotion expressiveness while preserving $WER$ and speaker identity. EmoKnob demonstrates strong synergy with evolving foundation speech models and offers practical routes to control nuanced emotions like charisma and empathy with few-shot samples.
Abstract
While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
