ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis
Dehua Tao, Daxin Tan, Yu Ting Yeung, Xiao Chen, Tan Lee
TL;DR
ToneUnit addresses the tone shift problem in Mandarin Chinese speech synthesis by learning tone-aware discrete speech units through CTC supervision on tone-labeled data. It combines an SSL-based speech encoder (SPIRAL), a quantizer (VQ or FSQ), and a CTC decoder, with a VITS-based synthesizer decoding discrete units into waveform; FSQ emerges as the more effective quantization method, offering higher codeword usage and better intelligibility, even with limited labeled data. The results show that ToneUnit improves Mandarin synthesis and preserves competitive English performance, while revealing that discrete units can encode tonal information by mapping different tones of a phoneme to distinct unit sets. This work suggests ToneUnit as a bridge between speech representations and downstream language models, with potential for multimodal LLM integration.
Abstract
Representing speech as discretized units has numerous benefits in supporting downstream spoken language processing tasks. However, the approach has been less explored in speech synthesis of tonal languages like Mandarin Chinese. Our preliminary experiments on Chinese speech synthesis reveal the issue of "tone shift", where a synthesized speech utterance contains correct base syllables but incorrect tones. To address the issue, we propose the ToneUnit framework, which leverages annotated data with tone labels as CTC supervision to learn tone-aware discrete speech units for Mandarin Chinese speech. Our findings indicate that the discrete units acquired through the TonUnit resolve the "tone shift" issue in synthesized Chinese speech and yield favorable results in English synthesis. Moreover, the experimental results suggest that finite scalar quantization enhances the effectiveness of ToneUnit. Notably, ToneUnit can work effectively even with minimal annotated data.
