A Pilot Study of GSLM-based Simulation of Foreign Accentuation Only Using Native Speech Corpora
Kentaro Onda, Joonyong Park, Nobuaki Minematsu, Daisuke Saito
TL;DR
This work introduces a GSLM-based method to simulate foreign accentuation using only native speech corpora by converting input speech from language A into a unit sequence of language B (S2u) and resynthesizing it (u2S) to imprint B's accent. The approach leverages HuBERT-based representations and k-means clustering to produce discrete units and uses a single-speaker Tacotron2 for resynthesis, treating the unit sequence as linguistic text. Experiments across EN, JPN, CHN, SPN, and FRC show that more units reduce the degree of accent (lower WER/PER) but maintain characteristic L1-L2 substitution patterns, particularly for phonemes like [θ], indicating faithful reproduction of substitution tendencies. The results suggest that phonemic accentuation can be controllably synthesized using native corpora, offering a path toward more natural and varied accented speech without requiring non-native speech data; however, duration-based accent features remain an open challenge for future work.
Abstract
We propose a method of simulating the human process of foreign accentuation using Generative Spoken Language Model (GSLM) only with native speech corpora. When one listens to spoken words of a foreign language and repeats them, the repeated speech is often with the accent of that listener's L1. This is said to be because the spoken words are mentally represented as a sequence of phonological units of the L1, and those units are used for oral reproduction. We simulate this process by inputting speech of language A into GSLM of language B to add B's accent onto the input speech. The process of running ASR of the L1 for foreign input speech and giving the ASR result to TTS of the L1 can be viewed as a naive implementation of this approach. The results of our experiments show that the synthesized accent of the output speech is highly natural, compared to real samples of A generated by speakers whose L1 is B, and that the degree of accentuation is controllable.
