Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Jan Melechovsky; Ambuj Mehrish; Berrak Sisman; Dorien Herremans

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

TL;DR

A novel and efficient framework for accented Text-to-Speech synthesis based on a Conditional Variational Autoencoder that has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent.

Abstract

Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent. Our thorough experiments validate the effectiveness of the proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the model's ability to manipulate accents in the synthesized speech. Overall, our proposed framework presents a promising avenue for future accented TTS research.

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

TL;DR

Abstract

Paper Structure (11 sections, 2 equations, 4 figures, 1 table)

This paper contains 11 sections, 2 equations, 4 figures, 1 table.

Introduction
Related Work
Proposed Method
Experiments
Dataset
Training and Inference
Accent and Speaker Modelling Analysis
Objective Evaluation
Subjective Evaluation
Discussion on accent-identity balance
Conclusion

Figures (4)

Figure 1: An illustration of the training phase and overall architecture, includes Tacotron2 with a CVAE encoder.
Figure 2: Posterior Encoder architecture based on CVAE.
Figure 3: A t-SNE projection of the CVAE-NL embeddings. Each colour represents a different accent, whereas each shape represents a different speaker.
Figure 4: Subjective evaluation results.

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

TL;DR

Abstract

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Authors

TL;DR

Abstract

Table of Contents

Figures (4)