FleSpeech: Flexibly Controllable Speech Generation with Various Prompts
Hanzhao Li, Yuke Li, Xinsheng Wang, Jingbin Hu, Qicong Xie, Shan Yang, Lei Xie
TL;DR
FleSpeech tackles the rigidity of controllable speech generation by introducing a multi-stage framework that accepts multimodal prompts (text, audio, and visuals). A unified Multimodal Prompt Encoder (MPE) maps diverse prompts into a shared conditioning space, while a language-model–based semantic token predictor and a diffusion-based flow-matching acoustic generator synthesize speech with flexible control. The authors also develop a multimodal data collection pipeline to enable future multimodal research. Experimental results show significant gains in naturalness, speaker similarity, and emotion/gender control, with demonstrated extensions to speaking style editing and voice conversion. This approach broadens the practical capabilities of TTS systems in scenarios requiring nuanced, multimodal guidance and creative control.
Abstract
Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/
