Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Yuepeng Jiang; Tao Li; Fengyu Yang; Lei Xie; Meng Meng; Yujun Wang

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang

TL;DR

This work tackles zero-shot speech synthesis by enhancing prosody modeling while preserving speaker timbre. It introduces a diffusion-based pitch predictor conditioned on text content and a global timbre vector $S$, plus a hierarchical prosody adaptor that operates at multiple time scales to capture global and local prosody variations. A latent diffusion decoder and a SALN information injection module integrate timbre and prosody under large-scale Mandarin data, enabling natural and expressive synthesized speech for unseen speakers. Experimental results demonstrate comparable timbre quality to strong baselines while achieving superior naturalness and expressiveness, with ablations highlighting the importance of diffusion-based prosody and multi-scale modeling for reducing jitter and improving prosody details.

Abstract

Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

TL;DR

, plus a hierarchical prosody adaptor that operates at multiple time scales to capture global and local prosody variations. A latent diffusion decoder and a SALN information injection module integrate timbre and prosody under large-scale Mandarin data, enabling natural and expressive synthesized speech for unseen speakers. Experimental results demonstrate comparable timbre quality to strong baselines while achieving superior naturalness and expressiveness, with ablations highlighting the importance of diffusion-based prosody and multi-scale modeling for reducing jitter and improving prosody details.

Abstract

Paper Structure (14 sections, 7 equations, 2 figures, 2 tables)

This paper contains 14 sections, 7 equations, 2 figures, 2 tables.

Introduction
Proposed Approach
Overview
Pitch Predictor Based on Diffusion
Hierarchical Prosody Adaptor
Experiments
Dataset and Preprocessing
Training
Baseline
Results
Objective Evaluation
Subjective Evaluation
Ablation Study
Conclusions

Figures (2)

Figure 1: Architecture overview of the proposed model
Figure 2: The spectrograms of synthesized samples in hierarchical prosody modeling ablation.

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

TL;DR

Abstract

Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (2)