Table of Contents
Fetching ...

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

TL;DR

This work addresses robustness gaps in language-model-based text-to-speech by decoupling linguistic phonetics from fine-grained acoustics. It introduces a phonetic enhanced language modeling framework that uses phonetically rich self-supervised representations as autoregressive targets, paired with a non-autoregressive module that reconstructs 8 layers of acoustic codecs from predicted phonetic tokens. By leveraging acoustic prompts during inference, the system achieves robust zero-shot TTS with improved objective (WER/CER) and subjective (MOS/S-MOS) metrics compared to a VALL-E baseline. The findings indicate that the choice of SSL (WavLM vs HuBERT) and the clustering granularity substantially affect performance, with WavLM + larger clusters offering notable gains in noisy or unseen conditions, illustrating practical implications for scalable, robust TTS systems.

Abstract

Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

TL;DR

This work addresses robustness gaps in language-model-based text-to-speech by decoupling linguistic phonetics from fine-grained acoustics. It introduces a phonetic enhanced language modeling framework that uses phonetically rich self-supervised representations as autoregressive targets, paired with a non-autoregressive module that reconstructs 8 layers of acoustic codecs from predicted phonetic tokens. By leveraging acoustic prompts during inference, the system achieves robust zero-shot TTS with improved objective (WER/CER) and subjective (MOS/S-MOS) metrics compared to a VALL-E baseline. The findings indicate that the choice of SSL (WavLM vs HuBERT) and the clustering granularity substantially affect performance, with WavLM + larger clusters offering notable gains in noisy or unseen conditions, illustrating practical implications for scalable, robust TTS systems.

Abstract

Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.
Paper Structure (14 sections, 2 equations, 1 figure, 2 tables)

This paper contains 14 sections, 2 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The overall diagram of proposed phonetic enhanced language model (LM) based text-to-speech framework. Given input text and an acoustic prompt, the autoregressive decoder predicts self-supervised learning (SSL) tokens that contain phonetic information, and the non-autoregressive decoder further predicts 8 layers of acoustic codecs that represent fine-grained acoustic details.