High-Fidelity Neural Phonetic Posteriorgrams
Cameron Churchwell, Max Morrison, Bryan Pardo
TL;DR
This work presents an interpretable phonetic posteriorgram (PPG) representation that disentangles pronunciation from speaker identity and validates its efficacy by training a VITS-based speech synthesizer. It introduces a JS-divergence–based acoustic pronunciation distance, ΔPPG, grounded in a learned phoneme similarity matrix, and demonstrates that interpretable PPGs enable fine-grained pronunciation control including interpolation and regex-based accent editing. Across multiple audio representations and standard datasets, the proposed PPG framework achieves competitive phoneme accuracy and robust pitch disentanglement, with EnCodec-based PPGs achieving strong subjective synthesis quality. The combination of interpretable representation, quantitative pronunciation distance, and controllable pronunciation editing offers a practical pathway for pronunciation editing, accent conversion, and related linguistic applications, complemented by an open-source ppgs toolkit.
Abstract
A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.
