Table of Contents
Fetching ...

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi

TL;DR

The paper tackles the challenge of adapting SSL-based multilingual TTS to low-resource languages using limited data. It analyzes ZMM-TTS across 12 languages with varied fine-tuning configurations, employing both paired and audio-only data alongside zero-shot baselines, and introduces language similarity analysis via ASPF. Key findings show adaptation correlates with phonetic similarity and language category, while dataset size and speaker diversity significantly influence outcomes; intriguingly, paired data is not always superior to audio-only fine-tuning. The multi-metric evaluation, including intelligibility (CER), speaker similarity (SECS), language identification (LI), and MOS predictions, reveals nuanced trade-offs and informs practical strategies for low-resource TTS deployment.

Abstract

Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

TL;DR

The paper tackles the challenge of adapting SSL-based multilingual TTS to low-resource languages using limited data. It analyzes ZMM-TTS across 12 languages with varied fine-tuning configurations, employing both paired and audio-only data alongside zero-shot baselines, and introduces language similarity analysis via ASPF. Key findings show adaptation correlates with phonetic similarity and language category, while dataset size and speaker diversity significantly influence outcomes; intriguingly, paired data is not always superior to audio-only fine-tuning. The multi-metric evaluation, including intelligibility (CER), speaker similarity (SECS), language identification (LI), and MOS predictions, reveals nuanced trade-offs and informs practical strategies for low-resource TTS deployment.

Abstract

Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
Paper Structure (12 sections, 1 equation, 2 figures, 2 tables)

This paper contains 12 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: CER and ASPF values for different languages. The blue dashed line represents the best CER performance achievable by synthesized audio from 25 configurations, while the solid line represents the CER performance of natural audio. The purple dashed line represents the ASPF value most similar to the 6 pre-trained languages and its corresponding language.
  • Figure 2: CER results for different languages under various fine-tuning methods. $\checkmark$ represents the best result for each language, while an $\times$ indicates the worst result.