Table of Contents
Fetching ...

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, Alexander Waibel

TL;DR

This work tackles pronunciation issues in non-native speech within accent conversion (AC) by introducing a reference-free, non-autoregressive AC framework based on the VITS architecture. It employs a two-stage training regime: first pretraining a native VITS TTS and AC, then generating ideal native-pronunciation ground-truth for non-native inputs via MAS-aligned transcripts and synthetic synthesis, followed by finetuning with knowledge distillation from native TTS. Key innovations include sharing VITS components, generating ground-truth that preserves duration and speaker identity, and distilling accent-independent representations through a KL divergence between priors $p_{ heta_{audio}}(z|c_{audio})$ and $p_{ heta_{text}}(z|c_{text})$. Empirical results show improved WER and nativeness, with stable speaker identity (SECS around $0.82$-$0.84$) and nuanced ACC performance due to prosody retention, indicating practical benefits for pronunciation correction and accent conversion in multi-speaker settings. This approach has potential applications in dubbing, language learning, and real-time communication where intelligibility and natural-sounding pronunciation are critical.

Abstract

Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.

Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS

TL;DR

This work tackles pronunciation issues in non-native speech within accent conversion (AC) by introducing a reference-free, non-autoregressive AC framework based on the VITS architecture. It employs a two-stage training regime: first pretraining a native VITS TTS and AC, then generating ideal native-pronunciation ground-truth for non-native inputs via MAS-aligned transcripts and synthetic synthesis, followed by finetuning with knowledge distillation from native TTS. Key innovations include sharing VITS components, generating ground-truth that preserves duration and speaker identity, and distilling accent-independent representations through a KL divergence between priors and . Empirical results show improved WER and nativeness, with stable speaker identity (SECS around -) and nuanced ACC performance due to prosody retention, indicating practical benefits for pronunciation correction and accent conversion in multi-speaker settings. This approach has potential applications in dubbing, language learning, and real-time communication where intelligibility and natural-sounding pronunciation are critical.

Abstract

Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.

Paper Structure

This paper contains 16 sections, 11 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Pre-training and fine-tuning procedure of our proposed model. The parameters of blue components are frozen.