Table of Contents
Fetching ...

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

Ashishkumar Gudmalwar, Ishan D. Biyani, Nirmesh Shah, Pankaj Wasnik, Rajiv Ratn Shah

TL;DR

This paper proposes regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion within a diffusion-based framework, which is the first of its kind work.

Abstract

The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages \footnote{Demo samples are available at the following URL: \url{https://nirmesh-sony.github.io/EmoReg/}}.

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion

TL;DR

This paper proposes regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion within a diffusion-based framework, which is the first of its kind work.

Abstract

The Emotional Voice Conversion (EVC) aims to convert the discrete emotional state from the source emotion to the target for a given speech utterance while preserving linguistic content. In this paper, we propose regularizing emotion intensity in the diffusion-based EVC framework to generate precise speech of the target emotion. Traditional approaches control the intensity of an emotional state in the utterance via emotion class probabilities or intensity labels that often lead to inept style manipulations and degradations in quality. On the contrary, we aim to regulate emotion intensity using self-supervised learning-based feature representations and unsupervised directional latent vector modeling (DVM) in the emotional embedding space within a diffusion-based framework. These emotion embeddings can be modified based on the given target emotion intensity and the corresponding direction vector. Furthermore, the updated embeddings can be fused in the reverse diffusion process to generate the speech with the desired emotion and intensity. In summary, this paper aims to achieve high-quality emotional intensity regularization in the diffusion-based EVC framework, which is the first of its kind work. The effectiveness of the proposed method has been shown across state-of-the-art (SOTA) baselines in terms of subjective and objective evaluations for the English and Hindi languages \footnote{Demo samples are available at the following URL: \url{https://nirmesh-sony.github.io/EmoReg/}}.
Paper Structure (22 sections, 3 equations, 6 figures, 7 tables)

This paper contains 22 sections, 3 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Conceptual representation of emotional intensity regularization based on direction vector and intensity value.
  • Figure 2: Three key steps of the proposed DVM approach. 1) Fitting local GMM to each emotional state. 2) computing directional vectors for all possible transitions from the local mean of one emotional state to another. 3) Applying PCA to find relevant direction for emotional transition.
  • Figure 3: Block diagram of the proposed DVM-based Emotion Intensity Regularized EVC architecture. Dotted arrows represents operations performed only during training. Also, GT $\Bar{X}$ are derived by replacing each phoneme Mel-spectrogram feature in the input with its corresponding pre-calculated average feature.
  • Figure 4: Analysis of emotion similarity score with respect to incremental emotion intensity scale.
  • Figure 5: MUSHRA-based MOS scores for speech quality for proposed EmoReg approach and baseline methods.
  • ...and 1 more figures