Table of Contents
Fetching ...

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization

Wei-Ping Huang, Sung-Feng Huang, Hung-yi Lee

TL;DR

This work tackles the challenge of cross-lingual TTS where both labeled and unlabeled data from the target language are severely limited. It introduces a data-efficient transfer-learning framework that leverages self-supervised features during mix pretraining and employs pseudo-label mixing during fine-tuning to maximize information from unlabeled data, rather than discarding uncertain labels. An embedding initialization trick via an embedding generator further enhances adaptation, especially under ultra-low-resource conditions. Across six languages, the approach achieves intelligible speech with as few as 4 labeled utterances and 15 minutes of unlabeled data, with notable gains as more data becomes available, demonstrating strong practical potential for low-resource languages.

Abstract

This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems, with a focus on achieving language adaptation using minimal labeled and unlabeled data. While many works focus on reducing the usage of labeled data, very few consider minimizing the usage of unlabeled data. By utilizing self-supervised features in the pretraining stage, replacing the noisy portion of pseudo labels with these features during fine-tuning, and incorporating an embedding initialization trick, our method leverages more information from unlabeled data compared to conventional approaches. Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data. Our methodology continues to surpass conventional techniques, even when a greater volume of data is accessible. These findings highlight the potential of our data-efficient language adaptation framework.

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization

TL;DR

This work tackles the challenge of cross-lingual TTS where both labeled and unlabeled data from the target language are severely limited. It introduces a data-efficient transfer-learning framework that leverages self-supervised features during mix pretraining and employs pseudo-label mixing during fine-tuning to maximize information from unlabeled data, rather than discarding uncertain labels. An embedding initialization trick via an embedding generator further enhances adaptation, especially under ultra-low-resource conditions. Across six languages, the approach achieves intelligible speech with as few as 4 labeled utterances and 15 minutes of unlabeled data, with notable gains as more data becomes available, demonstrating strong practical potential for low-resource languages.

Abstract

This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems, with a focus on achieving language adaptation using minimal labeled and unlabeled data. While many works focus on reducing the usage of labeled data, very few consider minimizing the usage of unlabeled data. By utilizing self-supervised features in the pretraining stage, replacing the noisy portion of pseudo labels with these features during fine-tuning, and incorporating an embedding initialization trick, our method leverages more information from unlabeled data compared to conventional approaches. Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data. Our methodology continues to surpass conventional techniques, even when a greater volume of data is accessible. These findings highlight the potential of our data-efficient language adaptation framework.
Paper Structure (20 sections, 4 equations, 3 figures, 5 tables)

This paper contains 20 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the overall pipeline. (a) Mix pretraining. (b) Pseudo label mixing. We use $D_{source}$ for pretraining, and $D_{target}$ merged with pseudo corpus for fine-tuning.
  • Figure 2: Illustration of proposed pseudo label mixing methods.
  • Figure 3: CER[%] under different data settings. Different rows represent different amounts of $D_{target}$. From the first row to the last row are 4-shot, 16-shot, and 64-shot.