Table of Contents
Fetching ...

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

Dehua Tao, Daxin Tan, Yu Ting Yeung, Xiao Chen, Tan Lee

TL;DR

ToneUnit addresses the tone shift problem in Mandarin Chinese speech synthesis by learning tone-aware discrete speech units through CTC supervision on tone-labeled data. It combines an SSL-based speech encoder (SPIRAL), a quantizer (VQ or FSQ), and a CTC decoder, with a VITS-based synthesizer decoding discrete units into waveform; FSQ emerges as the more effective quantization method, offering higher codeword usage and better intelligibility, even with limited labeled data. The results show that ToneUnit improves Mandarin synthesis and preserves competitive English performance, while revealing that discrete units can encode tonal information by mapping different tones of a phoneme to distinct unit sets. This work suggests ToneUnit as a bridge between speech representations and downstream language models, with potential for multimodal LLM integration.

Abstract

Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data.

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

TL;DR

ToneUnit addresses the tone shift problem in Mandarin Chinese speech synthesis by learning tone-aware discrete speech units through CTC supervision on tone-labeled data. It combines an SSL-based speech encoder (SPIRAL), a quantizer (VQ or FSQ), and a CTC decoder, with a VITS-based synthesizer decoding discrete units into waveform; FSQ emerges as the more effective quantization method, offering higher codeword usage and better intelligibility, even with limited labeled data. The results show that ToneUnit improves Mandarin synthesis and preserves competitive English performance, while revealing that discrete units can encode tonal information by mapping different tones of a phoneme to distinct unit sets. This work suggests ToneUnit as a bridge between speech representations and downstream language models, with potential for multimodal LLM integration.

Abstract

Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data.
Paper Structure (22 sections, 1 equation, 2 figures, 4 tables)

This paper contains 22 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Pitch distribution for four tones of phoneme /i/ in Mandarin Chinese speech. The x-axis represents fundamental frequency, and the y-axis is the number of occurrences.
  • Figure 2: Components of the ToneUnit. The speech synthesizer is trained independently on discrete speech units generated by ToneUnit.