Table of Contents
Fetching ...

High-Fidelity Neural Phonetic Posteriorgrams

Cameron Churchwell, Max Morrison, Bryan Pardo

TL;DR

This work presents an interpretable phonetic posteriorgram (PPG) representation that disentangles pronunciation from speaker identity and validates its efficacy by training a VITS-based speech synthesizer. It introduces a JS-divergence–based acoustic pronunciation distance, ΔPPG, grounded in a learned phoneme similarity matrix, and demonstrates that interpretable PPGs enable fine-grained pronunciation control including interpolation and regex-based accent editing. Across multiple audio representations and standard datasets, the proposed PPG framework achieves competitive phoneme accuracy and robust pitch disentanglement, with EnCodec-based PPGs achieving strong subjective synthesis quality. The combination of interpretable representation, quantitative pronunciation distance, and controllable pronunciation editing offers a practical pathway for pronunciation editing, accent conversion, and related linguistic applications, complemented by an open-source ppgs toolkit.

Abstract

A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.

High-Fidelity Neural Phonetic Posteriorgrams

TL;DR

This work presents an interpretable phonetic posteriorgram (PPG) representation that disentangles pronunciation from speaker identity and validates its efficacy by training a VITS-based speech synthesizer. It introduces a JS-divergence–based acoustic pronunciation distance, ΔPPG, grounded in a learned phoneme similarity matrix, and demonstrates that interpretable PPGs enable fine-grained pronunciation control including interpolation and regex-based accent editing. Across multiple audio representations and standard datasets, the proposed PPG framework achieves competitive phoneme accuracy and robust pitch disentanglement, with EnCodec-based PPGs achieving strong subjective synthesis quality. The combination of interpretable representation, quantitative pronunciation distance, and controllable pronunciation editing offers a practical pathway for pronunciation editing, accent conversion, and related linguistic applications, complemented by an open-source ppgs toolkit.

Abstract

A phonetic posteriorgram (PPG) is a time-varying categorical distribution over acoustic units of speech (e.g., phonemes). PPGs are a popular representation in speech generation due to their ability to disentangle pronunciation features from speaker identity, allowing accurate reconstruction of pronunciation (e.g., voice conversion) and coarse-grained pronunciation editing (e.g., foreign accent conversion). In this paper, we demonstrably improve the quality of PPGs to produce a state-of-the-art interpretable PPG representation. We train an off-the-shelf speech synthesizer using our PPG representation and show that high-quality PPGs yield independent control over pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an acoustic pronunciation distance and fine-grained pronunciation control.
Paper Structure (12 sections, 1 equation, 4 figures, 1 table)

This paper contains 12 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Pronunciation interpolation and distance $|$ We train a VITS vits speech synthesizer on our interpretable PPGs and use it for (left) voice conversion, (center) pronunciation interpolation, and (right) manual phoneme editing. (top) We visualize overlapping PPGs of a recording of the word "tomato" (blue) and inferred from the synthesized speech (red). For readability, phoneme rows in the PPGs with maximum probability $<10\%$ are omitted. The accurate reconstruction of PPGs (magenta) indicates preservation of (potentially edited) phonetic content in the generated speech. In the center, the input (blue) PPG is interpolated halfway between the left and right PPGs using SLERP shoemake1985animating. Note that the reconstruction of interpolating "ey" (left) and "aa" (right) is "ae" or "eh" (center). This is consistent with interpolating vowels in formant space (F1, F2 - F1) ladefoged2014course and indicates that one pronunciation can be represented more than one way in a PPG. (bottom) Pronunciation distances between synthesized speech and the original audio. Our proposed distance (Section \ref{['sec:distance']}) is more robust to resynthesis artifacts and accurately captures pronunciation interpolation without a transcript.
  • Figure 2: Average framewise phoneme accuracy $|$ Accuracy of PPGs computed from five input representations. The wav2vec 2.0 wav2vec2 input representation has the best PPG accuracy when averaged over all datasets (see legend). N.B., The base wav2vec 2.0 model of Charsiu charsiu was trained on some of our Common Voice test partition as well as the TIMIT training partition, making Charsiu's results on those datasets unreliable upper bounds.
  • Figure 3: Crowdsourced subjective evaluation results $|$(top) Reconstruction quality of speech synthesized from PPGs inferred from five input representations, as well as high- and low-anchors. White dots are medians and black dots are means. A Wilcoxon signed-rank test gives $p=0.02$ between original speech and speech reconstructed using PPGs inferred from EnCodec. Interpretable PPGs inferred from EnCodec significantly outperform $(p<0.05)$ PPGs inferred from all other representations except Mel spectrograms ($p=0.25$).
  • Figure 4: Acoustic phoneme similarities $|$ Row $x$ column $y$ is $\mathcal{S}_{x, y} = \mathbb{E}\left[\lambda_y G_{y, t}; \lambda_x G_{x, t} \geq \lambda_z G_{z, t} \, \forall z \right]$, the average class-weighted probability assigned to phoneme $y$ when phoneme $x$ is the maximum model prediction. Averages are taken over all frames of our validation partition of Common Voice commonvoice using our PPG model trained with class-balancing on Mel spectrogram inputs. Red boxes show that the corresponding unvoiced fricative (/f/, /s/, /sh/) to each voiced fricative (/v/, /z/, /zh/) is assigned relatively high probability, and vice versa. Class-balanced training and class-weighting are used to remove column banding indicative of natural phoneme frequency.