Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Jan Melechovsky; Ambuj Mehrish; Berrak Sisman; Dorien Herremans

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, Dorien Herremans

TL;DR

The paper addresses accented TTS and introduces an MLVAE-based Tacotron2 system with adversarial training (MLVAE-ADV) to disentangle speaker and accent representations. By applying an accent classifier in adversarial training and a group ELBO objective, the model improves accent conversion relative to GST and original MLVAE baselines. Objective metrics show stronger reconstruction of mel spectrograms, while subjective tests reveal trade-offs in speaker identity preservation and voice quality. The work highlights the potential for more inclusive speech synthesis and suggests future work on larger balanced datasets and tuning to balance accent conversion with speaker consistency.

Abstract

With rapid globalization, the need to build inclusive and representative speech technology cannot be overstated. Accent is an important aspect of speech that needs to be taken into consideration while building inclusive speech synthesizers. Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent. We note that state-of-the-art Text-to-Speech (TTS) systems may currently not be suitable for all people, regardless of their background, as they are designed to generate high-quality voices without focusing on accent. In this paper, we propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion in TTS, with a vision for more inclusive systems in the future. We evaluate the performance through both objective metrics and subjective listening tests. The results show an improvement in accent conversion ability compared to the baseline.

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

TL;DR

Abstract

Paper Structure (10 sections, 4 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 10 sections, 4 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Related Work
Proposed Method
MLVAE Encoder
Experiments and results
Dataset and Baselines
Experimental Setup and Inference
Objective Evaluation
Subjective Evaluation
Conclusion

Figures (4)

Figure 1: The proposed model architecture with D-Step (discriminator) and G-Step (generator) illustrations.
Figure 2: Detailed view of the MLVAE encoder, $R_o$ is the output of the Reference Encoder.
Figure 3: A t-SNE projection of speaker and accent embeddings from the MLVAE-ADV model.
Figure 4: XAB accent and speaker similarity test results.

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

TL;DR

Abstract

Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training

Authors

TL;DR

Abstract

Table of Contents

Figures (4)