An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios
Cheng Gong, Erica Cooper, Xin Wang, Chunyu Qiang, Mengzhe Geng, Dan Wells, Longbiao Wang, Jianwu Dang, Marc Tessier, Aidan Pine, Korin Richmond, Junichi Yamagishi
TL;DR
The paper tackles the challenge of adapting SSL-based multilingual TTS to low-resource languages using limited data. It analyzes ZMM-TTS across 12 languages with varied fine-tuning configurations, employing both paired and audio-only data alongside zero-shot baselines, and introduces language similarity analysis via ASPF. Key findings show adaptation correlates with phonetic similarity and language category, while dataset size and speaker diversity significantly influence outcomes; intriguingly, paired data is not always superior to audio-only fine-tuning. The multi-metric evaluation, including intelligibility (CER), speaker similarity (SECS), language identification (LI), and MOS predictions, reveals nuanced trade-offs and informs practical strategies for low-resource TTS deployment.
Abstract
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system proposed in our previous work. We conducted experiments on 12 languages using limited data with various fine-tuning configurations. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance. Additionally, we find that the fine-tuning dataset size and number of speakers influence adaptability. Surprisingly, we also observed that using paired data for fine-tuning is not always optimal compared to audio-only data. Beyond speech intelligibility, our analysis covers speaker similarity, language identification, and predicted MOS.
