Table of Contents
Fetching ...

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

Ahad Jawaid, Shreeram Suresh Chandra, Junchen Lu, Berrak Sisman

TL;DR

StyleMoE is introduced, an approach that addresses the issue of learning averaged style representations in the style encoder by creating style experts that learn from subsets of data in the style encoder by creating style experts that learn from subsets of data.

Abstract

Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. However, encoding stylistic information (e.g., timbre, emotion, and prosody) from diverse and unseen reference speech remains a challenge. This paper introduces StyleMoE, an approach that addresses the issue of learning averaged style representations in the style encoder by creating style experts that learn from subsets of data. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts (MoE) layer. The style experts specialize by learning from subsets of reference speech routed to them by the gating network, enabling them to handle different aspects of the style space. As a result, StyleMoE improves the style coverage of the style encoder for style transfer TTS. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech. The proposed method enhances the performance of existing state-of-the-art style transfer TTS models and represents the first study of style MoE in TTS.

Style Mixture of Experts for Expressive Text-To-Speech Synthesis

TL;DR

StyleMoE is introduced, an approach that addresses the issue of learning averaged style representations in the style encoder by creating style experts that learn from subsets of data in the style encoder by creating style experts that learn from subsets of data.

Abstract

Recent advances in style transfer text-to-speech (TTS) have improved the expressiveness of synthesized speech. However, encoding stylistic information (e.g., timbre, emotion, and prosody) from diverse and unseen reference speech remains a challenge. This paper introduces StyleMoE, an approach that addresses the issue of learning averaged style representations in the style encoder by creating style experts that learn from subsets of data. The proposed method replaces the style encoder in a TTS framework with a Mixture of Experts (MoE) layer. The style experts specialize by learning from subsets of reference speech routed to them by the gating network, enabling them to handle different aspects of the style space. As a result, StyleMoE improves the style coverage of the style encoder for style transfer TTS. Our experiments, both objective and subjective, demonstrate improved style transfer for diverse and unseen reference speech. The proposed method enhances the performance of existing state-of-the-art style transfer TTS models and represents the first study of style MoE in TTS.
Paper Structure (17 sections, 4 equations, 3 figures, 1 table)

This paper contains 17 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: The architecture of StyleMoE-TTS. Red modules represent modules from GenerSpeech huang2022generspeech. Green modules represent the Mixture of Experts layer. Purple modules represent style experts. The darker purple modules represent the style experts chosen by the gating network. Subfigures (a) and (b) illustrate the integration of StyleMoE into StyleMoE-TTS. Subfigure (c) depicts the StyleMoE layer, wherein each style expert block is a style reference encoder. Subfigure (d) illustrates the gating network.
  • Figure 2: Style preference test on ESD reported with 95% confidence intervals.
  • Figure 3: Illustration of style expert utilization in StyleMoE layers ($N=2, k=1$). Each pie chart in a row represents a separate StyleMoE Layer across different hierarchical levels. Percentages are indicative of the style expert usage. The analysis is performed over all samples in (a) and on emotion subsets in (b), (c) and (d).