Table of Contents
Fetching ...

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

Yun Chen, Lingxiao Yang, Qi Chen, Jian-Huang Lai, Xiaohua Xie

TL;DR

An Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion that outperforms state-of-the-arts in both objective and subjective metrics is proposed.

Abstract

Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics.

Attention-based Interactive Disentangling Network for Instance-level Emotional Voice Conversion

TL;DR

An Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion that outperforms state-of-the-arts in both objective and subjective metrics is proposed.

Abstract

Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components. Existing approaches cannot well express fine-grained emotional attributes. In this paper, we propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion. We introduce a two-stage pipeline to effectively train our network: Stage I utilizes inter-speech contrastive learning to model fine-grained emotion and intra-speech disentanglement learning to better separate emotion and content. In Stage II, we propose to regularize the conversion with a multi-view consistency mechanism. This technique helps us transfer fine-grained emotion and maintain speech content. Extensive experiments show that our AINN outperforms state-of-the-arts in both objective and subjective metrics.
Paper Structure (15 sections, 12 equations, 7 figures, 3 tables)

This paper contains 15 sections, 12 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Illustration of (a) Domain-level emotional voice conversion, and (b) Instance-level emotional voice conversion.
  • Figure 2: The framework of the proposed method. It consists of two stages: Feature disentanglement and Conversion adaptation.
  • Figure 3: Illustration of Intra-speech interactive method.
  • Figure 4: The strength accuracy of (a) source speech and (b) converted speech.
  • Figure 5: The architecture of our AINN.
  • ...and 2 more figures