Table of Contents
Fetching ...

Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

Zhijun Jia, Huaying Xue, Xiulian Peng, Yan Lu

TL;DR

This work tackles zero-shot accent conversion under minimal supervision by separating the task into semantic-token conversion and speech synthesis. A semantic-token bridge allows conversion to occur with limited parallel data, while a target-accent speech generator trained on large corpora produces natural, prosodically appropriate output conditioned on a brief style prompt. Key contributions include a pre-trained semantic-conversion module with a BART/T5-like objective, a single-stage TF-Codec–based autoregressive speech generator, and extensive experiments showing state-of-the-art accent similarity, speech quality, and speaker maintenance with as little as 15 minutes of weakly parallel data. The approach demonstrates strong adaptability to new accents and offers significant gains in efficiency, making it suitable for deployment in low-resource scenarios and scalable to additional language pairs.

Abstract

Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at https://www.microsoft.com/en-us/research/project/convert-and-speak-zero-shot-accent-conversion-with-minimumsupervision/.

Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

TL;DR

This work tackles zero-shot accent conversion under minimal supervision by separating the task into semantic-token conversion and speech synthesis. A semantic-token bridge allows conversion to occur with limited parallel data, while a target-accent speech generator trained on large corpora produces natural, prosodically appropriate output conditioned on a brief style prompt. Key contributions include a pre-trained semantic-conversion module with a BART/T5-like objective, a single-stage TF-Codec–based autoregressive speech generator, and extensive experiments showing state-of-the-art accent similarity, speech quality, and speaker maintenance with as little as 15 minutes of weakly parallel data. The approach demonstrates strong adaptability to new accents and offers significant gains in efficiency, making it suitable for deployment in low-resource scenarios and scalable to additional language pairs.

Abstract

Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at https://www.microsoft.com/en-us/research/project/convert-and-speak-zero-shot-accent-conversion-with-minimumsupervision/.
Paper Structure (21 sections, 3 equations, 4 figures, 5 tables)

This paper contains 21 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Proposed framework. the source accent semantic tokens are converted to target accent semantic tokens in the first stage and the speech is generated with target accent prosody conditioned on the converted semantic tokens in the second stage. The style prompt is extracted from the first 3 seconds of the source speech. TF-Codec token is a group of concatenated embeddings of each quantizer.
  • Figure 2: Accent classification results for VCTK test set, evaluated by CommonAccent.
  • Figure 3: Accent classification results for L1-L2 ARCTIC test set, evaluated by CommonAccent.
  • Figure 4: An example of pitch contour and phonemes improvement. (Content: "It's also very valuable.")