Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang
TL;DR
This work tackles zero-shot speech synthesis by enhancing prosody modeling while preserving speaker timbre. It introduces a diffusion-based pitch predictor conditioned on text content and a global timbre vector $S$, plus a hierarchical prosody adaptor that operates at multiple time scales to capture global and local prosody variations. A latent diffusion decoder and a SALN information injection module integrate timbre and prosody under large-scale Mandarin data, enabling natural and expressive synthesized speech for unseen speakers. Experimental results demonstrate comparable timbre quality to strong baselines while achieving superior naturalness and expressiveness, with ablations highlighting the importance of diffusion-based prosody and multi-scale modeling for reducing jitter and improving prosody details.
Abstract
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness.
