Table of Contents
Fetching ...

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Xinfa Zhu, Yuke Li, Yi Lei, Ning Jiang, Guoqing Zhao, Lei Xie

TL;DR

This work tackles multi-speaker expressive TTS by learning disentangled style, emotion, and speaker representations through a two-level contrastive learning framework and a semi-supervised training strategy. The Speech Representation Learning (SRL) module, trained with utterance-level and category-level positives and mutual-information minimization via $v$CLUB, feeds an enhanced VITS model equipped with a flow-based prosody adaptor to synthesize speech in diverse styles and emotions for a target speaker, including cross-lingual scenarios. Across monolingual and multilingual experiments, the method outperforms baselines in naturalness and stylistic/emotional/speaker similarity, with ablations confirming the benefits of contrastive learning and semi-supervised data. The approach demonstrates robust cross-language style and emotion transfer, enabling expressive synthesis even when target styles or emotions are outside the training data, and holds practical promise for flexible, high-quality TTS deployment.

Abstract

This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.

Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

TL;DR

This work tackles multi-speaker expressive TTS by learning disentangled style, emotion, and speaker representations through a two-level contrastive learning framework and a semi-supervised training strategy. The Speech Representation Learning (SRL) module, trained with utterance-level and category-level positives and mutual-information minimization via CLUB, feeds an enhanced VITS model equipped with a flow-based prosody adaptor to synthesize speech in diverse styles and emotions for a target speaker, including cross-lingual scenarios. Across monolingual and multilingual experiments, the method outperforms baselines in naturalness and stylistic/emotional/speaker similarity, with ablations confirming the benefits of contrastive learning and semi-supervised data. The approach demonstrates robust cross-language style and emotion transfer, enabling expressive synthesis even when target styles or emotions are outside the training data, and holds practical promise for flexible, high-quality TTS deployment.

Abstract

This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.
Paper Structure (15 sections, 3 equations, 4 figures, 2 tables)

This paper contains 15 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The architecture of speech representation learning module.
  • Figure 2: The architecture of multi-speaker expressive VITS.
  • Figure 3: T-SNE visualization of style representation (above) and emotion representation (below). We color the results with the corresponding category (left) and speaker category (right).
  • Figure 4: T-SNE visualization of style representation (left) and emotion representation (right) in multilingual settings.