Accent-VITS:accent transfer for end-to-end TTS
Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, Lei Xie
TL;DR
Accent-VITS addresses cross-speaker accent transfer by disentangling speaker timbre from accent using a hierarchical CVAE within an end-to-end TTS framework. It adds a Pronunciation Encoder and a BN Constraint Module to model accent pronunciation ($z_{pr}$) and acoustic features ($z_{ac}$) separately, with text-to-accent and accent-to-wave stages conditioned on speaker identity and accented content. The model uses a prior encoder with a flow to enrich the latent $z_{ac}$ distribution and a posterior encoder operating on mel inputs, paired with a HiFi-GAN decoder and GAN losses, plus a duration predictor. Experiments on Mandarin data with four accents show Accent-VITS achieving higher speaker similarity, accent similarity, and naturalness than strong baselines, while ablations confirm the critical role of BN-guided hierarchical CVAEs. This approach improves robustness and stability of accent transfer in end-to-end TTS with practical impact for multilingual, accent-diverse synthesis.$
Abstract
Accent transfer aims to transfer an accent from a source speaker to synthetic speech in the target speaker's voice. The main challenge is how to effectively disentangle speaker timbre and accent which are entangled in speech. This paper presents a VITS-based end-to-end accent transfer model named Accent-VITS.Based on the main structure of VITS, Accent-VITS makes substantial improvements to enable effective and stable accent transfer.We leverage a hierarchical CVAE structure to model accent pronunciation information and acoustic features, respectively, using bottleneck features and mel spectrums as constraints.Moreover, the text-to-wave mapping in VITS is decomposed into text-to-accent and accent-to-wave mappings in Accent-VITS. In this way, the disentanglement of accent and speaker timbre becomes be more stable and effective.Experiments on multi-accent and Mandarin datasets show that Accent-VITS achieves higher speaker similarity, accent similarity and speech naturalness as compared with a strong baseline.
