Table of Contents
Fetching ...

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

TL;DR

This paper tackles emotional voice conversion by enabling one-to-many and unseen emotion transfer without requiring parallel training data. It introduces DeepEST, a framework built on VAW-GAN whose decoder is conditioned on deep emotional features extracted by a pre-trained SER, allowing emotion transfer to any input. The method uses a three-stage pipeline (emotion descriptor training, encoder-decoder training with VAW-GAN, and run-time conversion) and leverages a new multi-lingual emotional speech dataset (ESD) for evaluation. Results show DeepEST surpasses a strong baseline on seen emotions and remains competitive for unseen emotions, with both objective and subjective improvements. The work also provides the ESD corpus to the community for broader cross-lingual and expressive speech research.

Abstract

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

TL;DR

This paper tackles emotional voice conversion by enabling one-to-many and unseen emotion transfer without requiring parallel training data. It introduces DeepEST, a framework built on VAW-GAN whose decoder is conditioned on deep emotional features extracted by a pre-trained SER, allowing emotion transfer to any input. The method uses a three-stage pipeline (emotion descriptor training, encoder-decoder training with VAW-GAN, and run-time conversion) and leverages a new multi-lingual emotional speech dataset (ESD) for evaluation. Results show DeepEST surpasses a strong baseline on seen emotions and remains competitive for unseen emotions, with both objective and subjective improvements. The work also provides the ESD corpus to the community for broader cross-lingual and expressive speech research.

Abstract

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.

Paper Structure

This paper contains 13 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: t-SNE plot of deep emotional features for 20 utterances with the same content but spoken by different speakers.
  • Figure 2: The training phase of the proposed DeepEST framework. Blue boxes represent the networks that involved in the training and red boxes represent the networks that are already trained.
  • Figure 3: AB preference test results for the speech quality.
  • Figure 4: XAB preference test results for the emotion similarity.