Table of Contents
Fetching ...

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Zhiyong Chen, Xinnuo Li, Zhiqi Ai, Shugong Xu

TL;DR

StyleFusion-TTS tackles zero-shot TTS with controllable style and speaker identity by integrating text prompts, audio style references, and speaker timbre references into a unified framework. It introduces a General Style Fusion encoder (GSF-enc) to disentangle and fuse style and speaker embeddings from multimodal inputs and a Hierarchical Conformer Two-Branch Style Control Module (HC-TSCM) to optimally fuse these controls into a VITS-based backbone. The system is trained end-to-end on multi-speaker data with LLM-augmented prompts and evaluated through comprehensive subjective and objective metrics, showing superior MOS and speaker similarity while preserving naturalness. Ablation studies confirm the value of HC-TSCM and multimodal prompts, and the authors propose extending the approach to multilingual settings to broaden applicability and expressiveness of zero-shot TTS.

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs including text prompts, audio references, and speaker timbre references in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis.

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

TL;DR

StyleFusion-TTS tackles zero-shot TTS with controllable style and speaker identity by integrating text prompts, audio style references, and speaker timbre references into a unified framework. It introduces a General Style Fusion encoder (GSF-enc) to disentangle and fuse style and speaker embeddings from multimodal inputs and a Hierarchical Conformer Two-Branch Style Control Module (HC-TSCM) to optimally fuse these controls into a VITS-based backbone. The system is trained end-to-end on multi-speaker data with LLM-augmented prompts and evaluated through comprehensive subjective and objective metrics, showing superior MOS and speaker similarity while preserving naturalness. Ablation studies confirm the value of HC-TSCM and multimodal prompts, and the authors propose extending the approach to multilingual settings to broaden applicability and expressiveness of zero-shot TTS.

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs including text prompts, audio references, and speaker timbre references in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis.
Paper Structure (14 sections, 6 equations, 6 figures, 7 tables)

This paper contains 14 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Model overview for StyleFusion-TTS
  • Figure 2: Front-end general style fusion encoder (GSF-enc) for speaker and style representation and disentanglement
  • Figure 3: Hierachical conformer TSCM (HC-TSCM) for control-fusion
  • Figure 4: Style-control prompt generation pipline with LLM
  • Figure 5: (a)-(b) The style and speaker embeddings of GSF-enc
  • ...and 1 more figures