Table of Contents
Fetching ...

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Hanzhao Li, Yuke Li, Xinsheng Wang, Jingbin Hu, Qicong Xie, Shan Yang, Lei Xie

TL;DR

FleSpeech tackles the rigidity of controllable speech generation by introducing a multi-stage framework that accepts multimodal prompts (text, audio, and visuals). A unified Multimodal Prompt Encoder (MPE) maps diverse prompts into a shared conditioning space, while a language-model–based semantic token predictor and a diffusion-based flow-matching acoustic generator synthesize speech with flexible control. The authors also develop a multimodal data collection pipeline to enable future multimodal research. Experimental results show significant gains in naturalness, speaker similarity, and emotion/gender control, with demonstrated extensions to speaking style editing and voice conversion. This approach broadens the practical capabilities of TTS systems in scenarios requiring nuanced, multimodal guidance and creative control.

Abstract

Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

TL;DR

FleSpeech tackles the rigidity of controllable speech generation by introducing a multi-stage framework that accepts multimodal prompts (text, audio, and visuals). A unified Multimodal Prompt Encoder (MPE) maps diverse prompts into a shared conditioning space, while a language-model–based semantic token predictor and a diffusion-based flow-matching acoustic generator synthesize speech with flexible control. The authors also develop a multimodal data collection pipeline to enable future multimodal research. Experimental results show significant gains in naturalness, speaker similarity, and emotion/gender control, with demonstrated extensions to speaking style editing and voice conversion. This approach broadens the practical capabilities of TTS systems in scenarios requiring nuanced, multimodal guidance and creative control.

Abstract

Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/
Paper Structure (34 sections, 2 equations, 6 figures, 5 tables)

This paper contains 34 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: FleSpeech can flexibly generate speech that matches the given prompts.
  • Figure 2: The model architecture of FleSpeech.
  • Figure 3: Cosine similarity matrix of speaker embeddings between face-prompt-based synthesized speech and ground-truth speech. The horizontal axis represents different synthesized speech, while the vertical axis represents ground-truth speech. The diagonal indicates that the image prompt and ground-truth speech are from the same speaker. Lighter colors indicate higher similarity.
  • Figure 4: Fundamental Frequency (F0) curve of the speech at different ages and BMI levels groups by gender
  • Figure 5: MPE in language model
  • ...and 1 more figures