Table of Contents
Fetching ...

Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams

Zirui Li, Lauri Juvela, Mikko Kurimo

TL;DR

PPG2Speech addresses the challenge of generating L2-like Finnish pronunciation by editing native speech through Phonetic Posteriorgrams within a diffusion-based framework. It extends the Matcha-TTS flow-matching decoder with Classifier-free Guidance and Sway Sampling to enable single-phoneme pronunciation edits without text alignment, and introduces Phonetic Aligned Consistency (PAC) as an objective editing metric. Experiments on Finnish data show improved generalization to unseen speakers and competitive naturalness, with PPG-based editing achieving meaningful editing fidelity according to PAC and subjective MOS, albeit with some timbre leakage. The work provides a practical pathway for L2 pronunciation training in low-resource languages and releases code for reproducibility.

Abstract

Synthesizing second-language (L2) speech is potentially highly valued for L2 language learning experience and feedback. However, due to the lack of L2 speech synthesis datasets, it is difficult to synthesize L2 speech for low-resourced languages. In this paper, we provide a practical solution for editing native speech to approximate L2 speech and present PPG2Speech, a diffusion-based multispeaker Phonetic-Posteriorgrams-to-Speech model that is capable of editing a single phoneme without text alignment. We use Matcha-TTS's flow-matching decoder as the backbone, transforming Phonetic Posteriorgrams (PPGs) to mel-spectrograms conditioned on external speaker embeddings and pitch. PPG2Speech strengthens the Matcha-TTS's flow-matching decoder with Classifier-free Guidance (CFG) and Sway Sampling. We also propose a new task-specific objective evaluation metric, the Phonetic Aligned Consistency (PAC), between the edited PPGs and the PPGs extracted from the synthetic speech for editing effects. We validate the effectiveness of our method on Finnish, a low-resourced, nearly phonetic language, using approximately 60 hours of data. We conduct objective and subjective evaluations of our approach to compare its naturalness, speaker similarity, and editing effectiveness with TTS-based editing. Our source code is published at https://github.com/aalto-speech/PPG2Speech.

Pronunciation Editing for Finnish Speech using Phonetic Posteriorgrams

TL;DR

PPG2Speech addresses the challenge of generating L2-like Finnish pronunciation by editing native speech through Phonetic Posteriorgrams within a diffusion-based framework. It extends the Matcha-TTS flow-matching decoder with Classifier-free Guidance and Sway Sampling to enable single-phoneme pronunciation edits without text alignment, and introduces Phonetic Aligned Consistency (PAC) as an objective editing metric. Experiments on Finnish data show improved generalization to unseen speakers and competitive naturalness, with PPG-based editing achieving meaningful editing fidelity according to PAC and subjective MOS, albeit with some timbre leakage. The work provides a practical pathway for L2 pronunciation training in low-resource languages and releases code for reproducibility.

Abstract

Synthesizing second-language (L2) speech is potentially highly valued for L2 language learning experience and feedback. However, due to the lack of L2 speech synthesis datasets, it is difficult to synthesize L2 speech for low-resourced languages. In this paper, we provide a practical solution for editing native speech to approximate L2 speech and present PPG2Speech, a diffusion-based multispeaker Phonetic-Posteriorgrams-to-Speech model that is capable of editing a single phoneme without text alignment. We use Matcha-TTS's flow-matching decoder as the backbone, transforming Phonetic Posteriorgrams (PPGs) to mel-spectrograms conditioned on external speaker embeddings and pitch. PPG2Speech strengthens the Matcha-TTS's flow-matching decoder with Classifier-free Guidance (CFG) and Sway Sampling. We also propose a new task-specific objective evaluation metric, the Phonetic Aligned Consistency (PAC), between the edited PPGs and the PPGs extracted from the synthetic speech for editing effects. We validate the effectiveness of our method on Finnish, a low-resourced, nearly phonetic language, using approximately 60 hours of data. We conduct objective and subjective evaluations of our approach to compare its naturalness, speaker similarity, and editing effectiveness with TTS-based editing. Our source code is published at https://github.com/aalto-speech/PPG2Speech.

Paper Structure

This paper contains 16 sections, 7 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The diagram of our model. $x_t$ is the noisy mel-spectrogram. $s$ is the speaker embedding, $p+V/UV$ is the pitch embedding sequence concatenated with the voiced/unvoiced flag, and $t \in [0, 1]$ is the diffusion time step.