Table of Contents
Fetching ...

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su, Jing Guo, Benlai Tang, Fengjie Zhu

TL;DR

The paper tackles the high cost and inconsistency of manual prosody annotation for expressive TTS. It introduces a two-stage approach that first performs contrastive pretraining on Speech-Silence and Word-Punctuation (SSWP) pairs to enrich prosodic representations, then builds a multi-modal annotator that fuses text and audio latent features to predict prosodic boundaries. The approach achieves state-of-the-art results on English PW and PPH boundary annotation, with PW f1 ≈ 0.72 and PPH f1 ≈ 0.93, and shows strong robustness in data-scarce scenarios, plus a notable subjective improvement in naturalness. This work advances controllable TTS by enabling accurate, scalable automatic prosody annotation and suggests promising directions for cross-lingual extension and broader prosodic feature coverage.

Abstract

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

TL;DR

The paper tackles the high cost and inconsistency of manual prosody annotation for expressive TTS. It introduces a two-stage approach that first performs contrastive pretraining on Speech-Silence and Word-Punctuation (SSWP) pairs to enrich prosodic representations, then builds a multi-modal annotator that fuses text and audio latent features to predict prosodic boundaries. The approach achieves state-of-the-art results on English PW and PPH boundary annotation, with PW f1 ≈ 0.72 and PPH f1 ≈ 0.93, and shows strong robustness in data-scarce scenarios, plus a notable subjective improvement in naturalness. This work advances controllable TTS by enabling accurate, scalable automatic prosody annotation and suggests promising directions for cross-lingual extension and broader prosodic feature coverage.

Abstract

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data scarcity.
Paper Structure (14 sections, 4 equations, 1 figure, 5 tables)