Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Tianle Yang; Chengzhe Sun; Phil Rose; Cassandra L. Jacobs; Siwei Lyu

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Tianle Yang, Chengzhe Sun, Phil Rose, Cassandra L. Jacobs, Siwei Lyu

Abstract

This study proposes a segmental-level prosodic probing framework to evaluate neural TTS models' ability to reproduce consonant-induced f0 perturbation, a fine-grained segmental-prosodic effect that reflects local articulatory mechanisms. We compare synthetic and natural speech realizations for thousands of words, stratified by lexical frequency, using Tacotron 2 and FastSpeech 2 trained on the same speech corpus (LJ Speech). These controlled analyses are then complemented by a large-scale evaluation spanning multiple advanced TTS systems. Results show accurate reproduction for high-frequency words but poor generalization to low-frequency items, suggesting that the examined TTS architectures rely more on lexical-level memorization than on abstract segmental-prosodic encoding. This finding highlights a limitation in such TTS systems' ability to generalize prosodic detail beyond seen data. The proposed probe offers a linguistically informed diagnostic framework that may inform future TTS evaluation methods, and has implications for interpretability and authenticity assessment in synthetic speech.

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Abstract

Paper Structure (20 sections, 5 figures, 2 tables)

This paper contains 20 sections, 5 figures, 2 tables.

Introduction
Background
Previous work on F0 perturbation
Comparison Baseline for F0 perturbation
Comparing different model structure
Implication
Experimental Setting
Speech generating and alignment
Acoustic extraction
Statistical modeling
Additional settings for Experiment 2
GAMM Structure and Implementation
Corpus analysis
Analysis and results (Experiment 1)
Direct comparison of the training exposure
...and 5 more sections

Figures (5)

Figure 1: GAMM smooths of $f_{0}$ trajectories for onset types (leftmost column) and pairwise difference smooths (right three columns) for all high-frequency speech sources. Shaded areas indicate significant intervals.
Figure 2: GAMM smooths of $f_{0}$ trajectories for onset types (leftmost column) and pairwise difference smooths (right three columns) for all low-frequency speech sources. Shaded areas indicate significant intervals.
Figure 3: GAMM smooths and difference smooths for $z$-scored $f_{0}$ trajectories (seen vs unseen).
Figure 4: GAMM smooths and difference smooths for $z$-scored $f_{0}$ trajectories (high frequency group).
Figure 5: GAMM smooths and difference smooths for $z$-scored $f_{0}$ trajectories (low frequency group).

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Abstract

Assessing the Ability of Neural TTS Systems to Model Consonant-Induced F0 Perturbation

Authors

Abstract

Table of Contents

Figures (5)