Table of Contents
Fetching ...

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

Navin Raj Prabhu, Bunlong Lay, Simon Welker, Nale Lehmann-Willenbrock, Timo Gerkmann

TL;DR

The paper tackles speech emotion conversion (SEC) in non-parallel, in-the-wild data by representing emotion with a continuous arousal dimension and using a diffusion-based model, EmoConv-Diff. It integrates three encoders (phoneme, speaker, emotion) and a diffusion decoder to disentangle lexical content, identity, and emotion, enabling target-arousal conditioning without parallel utterances. Training relies on score matching and mel-spectrogram reconstruction losses, with a Tweedie-based approximation of the source spectrogram guiding learning; inference uses a target emotion embedding derived from reference samples. Evaluated on MSP-Podcast v1.10, EmoConv-Diff achieves competitive SEC performance and robustness to extreme arousal values, demonstrating effective intensity-controlled emotion transfer in real-world data. This approach advances SEC by removing the parallel-data requirement and improving extreme-emotion handling, with practical implications for natural, emotionally expressive speech in HCI applications.

Abstract

Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.

EMOCONV-DIFF: Diffusion-based Speech Emotion Conversion for Non-parallel and In-the-wild Data

TL;DR

The paper tackles speech emotion conversion (SEC) in non-parallel, in-the-wild data by representing emotion with a continuous arousal dimension and using a diffusion-based model, EmoConv-Diff. It integrates three encoders (phoneme, speaker, emotion) and a diffusion decoder to disentangle lexical content, identity, and emotion, enabling target-arousal conditioning without parallel utterances. Training relies on score matching and mel-spectrogram reconstruction losses, with a Tweedie-based approximation of the source spectrogram guiding learning; inference uses a target emotion embedding derived from reference samples. Evaluated on MSP-Podcast v1.10, EmoConv-Diff achieves competitive SEC performance and robustness to extreme arousal values, demonstrating effective intensity-controlled emotion transfer in real-world data. This approach advances SEC by removing the parallel-data requirement and improving extreme-emotion handling, with practical implications for natural, emotionally expressive speech in HCI applications.

Abstract

Speech emotion conversion is the task of converting the expressed emotion of a spoken utterance to a target emotion while preserving the lexical content and speaker identity. While most existing works in speech emotion conversion rely on acted-out datasets and parallel data samples, in this work we specifically focus on more challenging in-the-wild scenarios and do not rely on parallel data. To this end, we propose a diffusion-based generative model for speech emotion conversion, the EmoConv-Diff, that is trained to reconstruct an input utterance while also conditioning on its emotion. Subsequently, at inference, a target emotion embedding is employed to convert the emotion of the input utterance to the given target emotion. As opposed to performing emotion conversion on categorical representations, we use a continuous arousal dimension to represent emotions while also achieving intensity control. We validate the proposed methodology on a large in-the-wild dataset, the MSP-Podcast v1.10. Our results show that the proposed diffusion model is indeed capable of synthesizing speech with a controllable target emotion. Crucially, the proposed approach shows improved performance along the extreme values of arousal and thereby addresses a common challenge in the speech emotion conversion literature.
Paper Structure (9 sections, 10 equations, 3 figures, 1 table)

This paper contains 9 sections, 10 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Illustration of the training and inference process of the proposed EmoConv-Diff approach. Dotted arrows denote operations performed only during training. The stop gradient function stops the accumulation of the gradients of the inputs during the training.
  • Figure 2: Sample log-energy spectrogram of emotion converted speech, along with comparisons on pitch contours.
  • Figure 3: Class-wise $L_{mse}$ performances for target arousal $\Bar{e}$ and ground-truth arousal $e$.