Table of Contents
Fetching ...

Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora

Kentaro Onda, Keisuke Imoto, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu

TL;DR

This work extends prior discrete token-based foreign-accent simulation by integrating duration modeling to reproduce durational accents using only native speech data. By employing a Generative Spoken Language Model with $S2u$ and $u2S$ modules and discretizing SSL features via $k$-means ($k \\in \\{50,200,1000\}$), the method converts input speech from language A into a $B$-accented variant without relying on accented data. A two-step duration modification—de-duplication plus a neural duration predictor inspired by FastSpeech2—modifies unit sequences before synthesis, enabling realistic durational patterns such as isochrony observed in Japanese English. Experimental results show successful replication of durational accents and improved subjective naturalness for Japanese-accented English, though there is some trade-off in intelligibility; the approach holds promise for robust cross-lingual speech perception and ASR training without requiring accented data. Overall, the paper demonstrates that duration-aware, native-speech-only accent simulation can yield more natural prosody and broaden accessibility to diverse accents across languages.

Abstract

Recently, a method for synthesizing foreign-accented speech only with native speech data using discrete tokens obtained from self-supervised learning (SSL) models was proposed. Considering limited availability of accented speech data, this method is expected to make it much easier to simulate foreign accents. By using the synthesized accented speech as listening materials for humans or training data for automatic speech recognition (ASR), both of them will acquire higher robustness against foreign accents. However, the previous method has a fatal flaw that it cannot reproduce duration-related accents. Durational accents are commonly seen when L2 speakers, whose native language has syllable-timed or mora-timed rhythm, speak stress-timed languages, such as English. In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. Experiments show that the proposed method successfully replicates durational accents seen in real L2 speech.

Prosodically Enhanced Foreign Accent Simulation by Discrete Token-based Resynthesis Only with Native Speech Corpora

TL;DR

This work extends prior discrete token-based foreign-accent simulation by integrating duration modeling to reproduce durational accents using only native speech data. By employing a Generative Spoken Language Model with and modules and discretizing SSL features via -means (), the method converts input speech from language A into a -accented variant without relying on accented data. A two-step duration modification—de-duplication plus a neural duration predictor inspired by FastSpeech2—modifies unit sequences before synthesis, enabling realistic durational patterns such as isochrony observed in Japanese English. Experimental results show successful replication of durational accents and improved subjective naturalness for Japanese-accented English, though there is some trade-off in intelligibility; the approach holds promise for robust cross-lingual speech perception and ASR training without requiring accented data. Overall, the paper demonstrates that duration-aware, native-speech-only accent simulation can yield more natural prosody and broaden accessibility to diverse accents across languages.

Abstract

Recently, a method for synthesizing foreign-accented speech only with native speech data using discrete tokens obtained from self-supervised learning (SSL) models was proposed. Considering limited availability of accented speech data, this method is expected to make it much easier to simulate foreign accents. By using the synthesized accented speech as listening materials for humans or training data for automatic speech recognition (ASR), both of them will acquire higher robustness against foreign accents. However, the previous method has a fatal flaw that it cannot reproduce duration-related accents. Durational accents are commonly seen when L2 speakers, whose native language has syllable-timed or mora-timed rhythm, speak stress-timed languages, such as English. In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. Experiments show that the proposed method successfully replicates durational accents seen in real L2 speech.

Paper Structure

This paper contains 15 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Foreign accent simulation with duration modeling
  • Figure 2: Forced alignment results of: (a) real American English (USA/M02), (b) real Japanese English (IWA/M01), (c) baseline synthesized Japanese English, (d) duration-modified synthesized Japanese English. (c) and (d) are synthesized using (a) as input speech. The script is "A good attitude is unbeatable."
  • Figure 3: Preference results as Japanese-accented English: a) total response counts, b) majority votes per sentence.