Table of Contents
Fetching ...

Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

Yizhong Geng, Jizhuo Xu, Zeyu Liang, Jinghan Yang, Xiaoyi Shi, Xiaoyu Shen

TL;DR

The paper tackles the gap in high-quality TTS for under-resourced languages by introducing a data-efficient framework that combines a text-centered preprocessing pipeline with a phoneme-tone adaptive acoustic model. Using Thai as a case study, it assembles a large, multi-faceted dataset and a novel Phoneme-Tone BERT-guided approach to capture tonal distinctions and grapheme-to-phoneme ambiguities with limited data. The system achieves state-of-the-art performance on both general and domain-specific tasks and enables zero-shot voice cloning, demonstrating industrial viability for domains like finance, healthcare, education, and law. By integrating pause prediction, robust G2P, speech-aware feature extraction, and style-conditioned generation, the method offers a scalable path to multilingual TTS that can extend to other tonal, low-resource languages.

Abstract

Text-to-speech (TTS) technology has achieved impressive results for widely spoken languages, yet many under-resourced languages remain challenged by limited data and linguistic complexities. In this paper, we present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems for low-resource scenarios. We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed. Our method enables zero-shot voice cloning and improved performance across diverse client applications, ranging from finance to healthcare, education, and law. Extensive evaluations - both subjective and objective - confirm that our model meets state-of-the-art standards, offering a scalable solution for TTS production in data-limited settings, with significant implications for broader industry adoption and multilingual accessibility.

Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

TL;DR

The paper tackles the gap in high-quality TTS for under-resourced languages by introducing a data-efficient framework that combines a text-centered preprocessing pipeline with a phoneme-tone adaptive acoustic model. Using Thai as a case study, it assembles a large, multi-faceted dataset and a novel Phoneme-Tone BERT-guided approach to capture tonal distinctions and grapheme-to-phoneme ambiguities with limited data. The system achieves state-of-the-art performance on both general and domain-specific tasks and enables zero-shot voice cloning, demonstrating industrial viability for domains like finance, healthcare, education, and law. By integrating pause prediction, robust G2P, speech-aware feature extraction, and style-conditioned generation, the method offers a scalable path to multilingual TTS that can extend to other tonal, low-resource languages.

Abstract

Text-to-speech (TTS) technology has achieved impressive results for widely spoken languages, yet many under-resourced languages remain challenged by limited data and linguistic complexities. In this paper, we present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems for low-resource scenarios. We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed. Our method enables zero-shot voice cloning and improved performance across diverse client applications, ranging from finance to healthcare, education, and law. Extensive evaluations - both subjective and objective - confirm that our model meets state-of-the-art standards, offering a scalable solution for TTS production in data-limited settings, with significant implications for broader industry adoption and multilingual accessibility.

Paper Structure

This paper contains 30 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of the Data-Optimized Framework Combined with Advanced Acoustic Model The architecture comprises two components: (1) the Preprocessing Pipeline (LLM → Tokenizer → grapheme-to-phoneme (G2P)), which converts raw text to phoneme-tone sequences; and (2) the TTS Model, where the Phoneme-Tone Bert module refines contextual pronunciation using text corpus inputs, integrated with acoustic modeling for speech synthesis.
  • Figure 2: Overview of the proposed TTS model, comprising audio feature extractors, a GAN-based decoder, and a prediction module. The diagram illustrates the different training stages.
  • Figure 3: Spectrogram comparison illustrating pause alignment across different TTS systems. The red bounding boxes highlight detected pause regions.
  • Figure 4: t-SNE visualization of speaker embeddings extracted from the synthesized speech. Each point represents a speaker embedding, and distinct clusters show that our zero-shot TTS preserves speaker identity.
  • Figure :