Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS
Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, Alexander Waibel
TL;DR
This work tackles pronunciation issues in non-native speech within accent conversion (AC) by introducing a reference-free, non-autoregressive AC framework based on the VITS architecture. It employs a two-stage training regime: first pretraining a native VITS TTS and AC, then generating ideal native-pronunciation ground-truth for non-native inputs via MAS-aligned transcripts and synthetic synthesis, followed by finetuning with knowledge distillation from native TTS. Key innovations include sharing VITS components, generating ground-truth that preserves duration and speaker identity, and distilling accent-independent representations through a KL divergence between priors $p_{ heta_{audio}}(z|c_{audio})$ and $p_{ heta_{text}}(z|c_{text})$. Empirical results show improved WER and nativeness, with stable speaker identity (SECS around $0.82$-$0.84$) and nuanced ACC performance due to prosody retention, indicating practical benefits for pronunciation correction and accent conversion in multi-speaker settings. This approach has potential applications in dubbing, language learning, and real-time communication where intelligibility and natural-sounding pronunciation are critical.
Abstract
Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.
