Table of Contents
Fetching ...

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Tianle Yang, Chengzhe Sun, Phil Rose, Cassandra L. Jacobs, Siwei Lyu

Abstract

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Abstract

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.
Paper Structure (20 sections, 5 figures, 2 tables)

This paper contains 20 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: GAMM smooths of $f_{0}$ trajectories for onset types (leftmost column) and pairwise difference smooths (right three columns) for all high-frequency speech sources. Shaded areas indicate significant intervals.
  • Figure 2: GAMM smooths of $f_{0}$ trajectories for onset types (leftmost column) and pairwise difference smooths (right three columns) for all low-frequency speech sources. Shaded areas indicate significant intervals.
  • Figure 3: GAMM smooths and difference smooths for $z$-scored $f_{0}$ trajectories (seen vs unseen).
  • Figure 4: GAMM smooths and difference smooths for $z$-scored $f_{0}$ trajectories (high frequency group).
  • Figure 5: GAMM smooths and difference smooths for $z$-scored $f_{0}$ trajectories (low frequency group).