Table of Contents
Fetching ...

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

Rui Wang, Qianguo Sun, Tianrong Chen, Zhiyun Zeng, Junlong Wu, Jiaxing Zhang

TL;DR

This work tackles the limitations of separating semantic and acoustic information in LLM-based TTS by introducing DistilCodec, a distillation framework that compresses multi-codebook neural audio codecs into a single 32,768-codebook with near-full utilization. Built atop DistilCodec, UniTTS provides end-to-end audio–text modeling by integrating three autoregressive tasks and a tri-stage training pipeline (Pretrain, SFT, Alignment) to preserve or enhance text capabilities while expanding audio modeling with universal data. The approach yields superior emotional expressiveness and fidelity in speech synthesis, demonstrated through extensive evaluations against state-of-the-art TTS systems and ablation studies showing the benefits of text instructions, prompt design, and alignment techniques. Overall, DistilCodec and UniTTS offer a scalable, data-diverse path to more natural and expressive TTS by unifying audio perception and cognition within an LLM-centric framework.

Abstract

The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS's pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM's text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at https://github.com/IDEA-Emdoor-Lab/UniTTS and https://github.com/IDEA-Emdoor-Lab/DistilCodec.

UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information

TL;DR

This work tackles the limitations of separating semantic and acoustic information in LLM-based TTS by introducing DistilCodec, a distillation framework that compresses multi-codebook neural audio codecs into a single 32,768-codebook with near-full utilization. Built atop DistilCodec, UniTTS provides end-to-end audio–text modeling by integrating three autoregressive tasks and a tri-stage training pipeline (Pretrain, SFT, Alignment) to preserve or enhance text capabilities while expanding audio modeling with universal data. The approach yields superior emotional expressiveness and fidelity in speech synthesis, demonstrated through extensive evaluations against state-of-the-art TTS systems and ablation studies showing the benefits of text instructions, prompt design, and alignment techniques. Overall, DistilCodec and UniTTS offer a scalable, data-diverse path to more natural and expressive TTS by unifying audio perception and cognition within an LLM-centric framework.

Abstract

The emergence of multi-codebook neutral audio codecs such as Residual Vector Quantization (RVQ) and Group Vector Quantization (GVQ) has significantly advanced Large-Language-Model (LLM) based Text-to-Speech (TTS) systems. These codecs are crucial in separating semantic and acoustic information while efficiently harnessing semantic priors. However, since semantic and acoustic information cannot be fully aligned, a significant drawback of these methods when applied to LLM-based TTS is that large language models may have limited access to comprehensive audio information. To address this limitation, we propose DistilCodec and UniTTS, which collectively offer the following advantages: 1) This method can distill a multi-codebook audio codec into a single-codebook audio codec with 32,768 codes while achieving a near 100\% utilization. 2) As DistilCodec does not employ a semantic alignment scheme, a large amount of high-quality unlabeled audio (such as audiobooks with sound effects, songs, etc.) can be incorporated during training, further expanding data diversity and broadening its applicability. 3) Leveraging the comprehensive audio information modeling of DistilCodec, we integrated three key tasks into UniTTS's pre-training framework: audio modality autoregression, text modality autoregression, and speech-text cross-modal autoregression. This allows UniTTS to accept interleaved text and speech/audio prompts while substantially preserving LLM's text capabilities. 4) UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment. Source code and model checkpoints are publicly available at https://github.com/IDEA-Emdoor-Lab/UniTTS and https://github.com/IDEA-Emdoor-Lab/DistilCodec.

Paper Structure

This paper contains 35 sections, 11 equations, 7 figures, 22 tables, 3 algorithms.

Figures (7)

  • Figure 1: The UniTTS architecture consists of an ALM tokenizer and an ALM backbone network, supporting both text and audio inputs and outputs. Within the architecture, DistilCodec is responsible for audio signal transformation: its encode module discretizes audio into latent representations, while the decode module reconstructs the waveform for acoustic output.
  • Figure 2: The detailed network architecture of DistilCodec.
  • Figure 3: Training Diagram of DisitilCodec.
  • Figure 4: Training schema of UniTTS and DistilCodec. DistilCodec consists of three core components: an Encoder, GRFVQ, and a Decoder, trained on universal audio data. The training process of UniTTS follows a methodology analogous to that of Large Language Models (LLMs), comprising three stages: Pretraining, Supervised Fine-Tuning (SFT), and Alignment. Notably, the pretraining phase utilizes universal audio as part of its training data.
  • Figure 5: Inference Prompt Template
  • ...and 2 more figures