Accent-VITS:accent transfer for end-to-end TTS

Linhan Ma; Yongmao Zhang; Xinfa Zhu; Yi Lei; Ziqian Ning; Pengcheng Zhu; Lei Xie

Accent-VITS:accent transfer for end-to-end TTS

Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie

TL;DR

Accent-VITS addresses cross-speaker accent transfer by disentangling speaker timbre from accent using a hierarchical CVAE within an end-to-end TTS framework. It adds a Pronunciation Encoder and a BN Constraint Module to model accent pronunciation ($z_{pr}$) and acoustic features ($z_{ac}$) separately, with text-to-accent and accent-to-wave stages conditioned on speaker identity and accented content. The model uses a prior encoder with a flow to enrich the latent $z_{ac}$ distribution and a posterior encoder operating on mel inputs, paired with a HiFi-GAN decoder and GAN losses, plus a duration predictor. Experiments on Mandarin data with four accents show Accent-VITS achieving higher speaker similarity, accent similarity, and naturalness than strong baselines, while ablations confirm the critical role of BN-guided hierarchical CVAEs. This approach improves robustness and stability of accent transfer in end-to-end TTS with practical impact for multilingual, accent-diverse synthesis.$

Abstract

Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline.

Accent-VITS:accent transfer for end-to-end TTS

TL;DR

) and acoustic features (

) separately, with text-to-accent and accent-to-wave stages conditioned on speaker identity and accented content. The model uses a prior encoder with a flow to enrich the latent

distribution and a posterior encoder operating on mel inputs, paired with a HiFi-GAN decoder and GAN losses, plus a duration predictor. Experiments on Mandarin data with four accents show Accent-VITS achieving higher speaker similarity, accent similarity, and naturalness than strong baselines, while ablations confirm the critical role of BN-guided hierarchical CVAEs. This approach improves robustness and stability of accent transfer in end-to-end TTS with practical impact for multilingual, accent-diverse synthesis.$

Abstract

Paper Structure (21 sections, 10 equations, 1 figure, 2 tables)

This paper contains 21 sections, 10 equations, 1 figure, 2 tables.

Introduction
Method
Pronunciation Encoder
BN Constraint Module
Prior Encoder
Posterior Encoder
Decoder
Duration Predictor
Final Loss
Experiments
Datasets
Model Configuration
Subjective Evaluation
Speaker Similarity.
Speech Naturalness.
...and 6 more sections

Figures (1)

Figure 1: Overview of Accent-VITS structure.

Accent-VITS:accent transfer for end-to-end TTS

TL;DR

Abstract

Accent-VITS:accent transfer for end-to-end TTS

Authors

TL;DR

Abstract

Table of Contents

Figures (1)