Table of Contents
Fetching ...

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

Xiangheng He, Junjie Chen, Zixing Zhang, Björn W. Schuller

TL;DR

ProsodyFM tackles two critical gaps in TTS prosody—phrasing and intonation—by introducing a flow-matching-based architecture with four specialized modules: Pitch Processor, Phrase Break Encoder, Text-Pitch Aligner, and Terminal Intonation Encoder. Trained in an unsupervised, conditional fashion via OT-CFM, it learns to infer flexible phrase breaks and robust intonation shapes without explicit prosodic labels, while maintaining strong intelligibility across unseen sentences and speakers. Objective metrics ($RMSE_{f0}$, $F1_{break}$, $WER$) and subjective MOS scores show ProsodyFM outperforms four SOTA baselines, with ablations confirming the complementary roles of the break and intonation components. The work demonstrates precise, fine-grained controllability of prosody and supports generalization to out-of-domain data, highlighting practical impact for high-quality, intelligible TTS in diverse contexts.

Abstract

Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

TL;DR

ProsodyFM tackles two critical gaps in TTS prosody—phrasing and intonation—by introducing a flow-matching-based architecture with four specialized modules: Pitch Processor, Phrase Break Encoder, Text-Pitch Aligner, and Terminal Intonation Encoder. Trained in an unsupervised, conditional fashion via OT-CFM, it learns to infer flexible phrase breaks and robust intonation shapes without explicit prosodic labels, while maintaining strong intelligibility across unseen sentences and speakers. Objective metrics (, , ) and subjective MOS scores show ProsodyFM outperforms four SOTA baselines, with ablations confirming the complementary roles of the break and intonation components. The work demonstrates precise, fine-grained controllability of prosody and supports generalization to out-of-domain data, highlighting practical impact for high-quality, intelligible TTS in diverse contexts.

Abstract

Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.

Paper Structure

This paper contains 38 sections, 10 equations, 35 figures, 5 tables, 1 algorithm.

Figures (35)

  • Figure 1: Pitch contours extracted from 5 pitch tracking methods (blue) and our pitch smoothing method (orange).
  • Figure 2: The model architecture of the proposed ProsodyFM during training. The components outlined by the yellow shaded area are unique to ProsodyFM and differ from those in MatchaTTS.
  • Figure 3: The key components of the proposed ProsodyFM in the training (a) and inference (b) phrases. The red markings highlight the differences. The snowflake mark means the module is frozen during training.
  • Figure 9: The instruction page of our Mean Opinion Score human listening test.
  • Figure 10: The labeled transcripts provided in the human listening test for all 15 testing samples under parallel and non-parallel settings.
  • ...and 30 more figures