Table of Contents
Fetching ...

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

TL;DR

A novel and efficient framework for accented Text-to-Speech synthesis based on a Conditional Variational Autoencoder that has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent.

Abstract

Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent. Our thorough experiments validate the effectiveness of the proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the model's ability to manipulate accents in the synthesized speech. Overall, our proposed framework presents a promising avenue for future accented TTS research.

Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder

TL;DR

A novel and efficient framework for accented Text-to-Speech synthesis based on a Conditional Variational Autoencoder that has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent.

Abstract

Accent plays a significant role in speech communication, influencing one's capability to understand as well as conveying a person's identity. This paper introduces a novel and efficient framework for accented Text-to-Speech (TTS) synthesis based on a Conditional Variational Autoencoder. It has the ability to synthesize a selected speaker's voice, and convert this to any desired target accent. Our thorough experiments validate the effectiveness of the proposed framework using both objective and subjective evaluations. The results also show remarkable performance in terms of the model's ability to manipulate accents in the synthesized speech. Overall, our proposed framework presents a promising avenue for future accented TTS research.
Paper Structure (11 sections, 2 equations, 4 figures, 1 table)

This paper contains 11 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An illustration of the training phase and overall architecture, includes Tacotron2 with a CVAE encoder.
  • Figure 2: Posterior Encoder architecture based on CVAE.
  • Figure 3: A t-SNE projection of the CVAE-NL embeddings. Each colour represents a different accent, whereas each shape represents a different speaker.
  • Figure 4: Subjective evaluation results.