DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models
Weihao wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu, Songjun Cao, Long Ma, Zhiyong Wu
TL;DR
This work tackles the limitations of deterministic prosody and traditional TTS in conversational speech synthesis by introducing DiffCSS, a framework that blends diffusion-based prosody generation with a language-model-backed TTS backbone. It uses a diffusion-based context-aware prosody predictor to sample diverse, contextually appropriate embeddings from multimodal conversational context, which are then consumed by a prosody-enhanced ParlerTTS backbone to synthesize speech. The key contributions include the first application of diffusion models to CSS for prosody, a two-stage training strategy, and a cross-attention mechanism to derive fixed-length prosody embeddings. Experimental results on LibriTTS-R and DailyTalk show improvements in expressiveness, coherence, and prosody diversity over strong baselines, highlighting DiffCSS's potential for more natural and varied conversational speech.
Abstract
Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize high-quality speech with sampled prosody embeddings. Experimental results demonstrate that the synthesized speech from DiffCSS is more diverse, contextually coherent, and expressive than existing CSS systems
