Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei; Yixuan Zhou; Liyang Chen; Dan Luo; Zhiyong Wu; Xixin Wu; Shiyin Kang; Tao Jiang; Yahui Zhou; Yuxing Han; Helen Meng

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Shun Lei, Yixuan Zhou, Liyang Chen, Dan Luo, Zhiyong Wu, Xixin Wu, Shiyin Kang, Tao Jiang, Yahui Zhou, Yuxing Han, Helen Meng

TL;DR

The paper tackles zero-shot TTS limitations arising from short acoustic prompts by introducing a multi-scale prompting framework. A speaker-aware text encoder captures phoneme-level speaking style from multi-sentence style prompts, while a VALL-E–based acoustic decoder preserves timbre from a frame-level timbre prompt, enabling end-to-end training. Trained on LibriTTS with randomly sampled reference utterances, the approach outperforms strong baselines in MOS and SECS, and shows stronger gains as the style-prompt length increases. This work advances practical cloning of unseen speakers by leveraging longer and more granular prompts to capture nuanced speaking style and timbre, with clear implications for more natural and personalized TTS systems.

Abstract

Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

TL;DR

Abstract

Improving Language Model-Based Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (3)