The Codec Language Model-based Zero-Shot Spontaneous Style TTS System for CoVoC Challenge 2024
Shuoyi Zhou, Yixuan Zhou, Weiqin Li, Jun Chen, Runchuan Ye, Weihao Wu, Zijian Lin, Shun Lei, Zhiyong Wu
TL;DR
This work presents a zero-shot spontaneous style TTS system for CoVoC 2024, built on a LLaMA-based codec language model with a delay-pattern autoregressive framework and augmented by Classifier-Free Guidance to boost intelligibility. The model leverages MT5-based text embeddings, HuBERT+K-means semantic tokens, and DAC-based acoustic tokens, trained in staged pretraining on large corpora and fine-tuned on high-quality spontaneous data (HQ-Conversations) plus premium WenetSpeech4TTS. Data preprocessing removes noise and overlaps, while CFG during inference blends conditional and unconditional signals to improve synthesis quality. In CoVoC constrained track evaluations, the approach achieves state-of-the-art naturalness MOS ($3.80$), strong quality and speaker similarity, and solid robustness, demonstrating effective zero-shot spontaneous-style voice cloning in practical conversational scenarios.
Abstract
This paper describes the zero-shot spontaneous style TTS system for the ISCSLP 2024 Conversational Voice Clone Challenge (CoVoC). We propose a LLaMA-based codec language model with a delay pattern to achieve spontaneous style voice cloning. To improve speech intelligibility, we introduce the Classifier-Free Guidance (CFG) strategy in the language model to strengthen conditional guidance on token prediction. To generate high-quality utterances, we adopt effective data preprocessing operations and fine-tune our model with selected high-quality spontaneous speech data. The official evaluations in the CoVoC constrained track show that our system achieves the best speech naturalness MOS of 3.80 and obtains considerable speech quality and speaker similarity results.
