Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis
Yizhong Geng, Jizhuo Xu, Zeyu Liang, Jinghan Yang, Xiaoyi Shi, Xiaoyu Shen
TL;DR
The paper tackles the gap in high-quality TTS for under-resourced languages by introducing a data-efficient framework that combines a text-centered preprocessing pipeline with a phoneme-tone adaptive acoustic model. Using Thai as a case study, it assembles a large, multi-faceted dataset and a novel Phoneme-Tone BERT-guided approach to capture tonal distinctions and grapheme-to-phoneme ambiguities with limited data. The system achieves state-of-the-art performance on both general and domain-specific tasks and enables zero-shot voice cloning, demonstrating industrial viability for domains like finance, healthcare, education, and law. By integrating pause prediction, robust G2P, speech-aware feature extraction, and style-conditioned generation, the method offers a scalable path to multilingual TTS that can extend to other tonal, low-resource languages.
Abstract
Text-to-speech (TTS) technology has achieved impressive results for widely spoken languages, yet many under-resourced languages remain challenged by limited data and linguistic complexities. In this paper, we present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems for low-resource scenarios. We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed. Our method enables zero-shot voice cloning and improved performance across diverse client applications, ranging from finance to healthcare, education, and law. Extensive evaluations - both subjective and objective - confirm that our model meets state-of-the-art standards, offering a scalable solution for TTS production in data-limited settings, with significant implications for broader industry adoption and multilingual accessibility.
