Table of Contents
Fetching ...

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

TL;DR

CLaM-TTS introduces a two-stage framework that combines a Mel-VAE with Residual-Quantized VAE (RVQ) and a latent language model to enable zero-shot TTS with long-discrete-token sequences compressed into a compact latent representation. By predicting latent z_t via a Gaussian Mixture conditioned on text and prior codes, and then quantizing to discrete tokens, the model emits multiple tokens per step, avoiding cascaded generation across multiple streams. Trained on 100K hours across 11 languages, CLaM-TTS demonstrates competitive or superior naturalness, intelligibility, and speaker similarity, with faster inference compared to neural codec baselines, and is shown to benefit from larger LM pretraining and careful text-tokenization choices. The work also analyzes codeword rate, robustness, and prompting strategies, offering practical guidance for scalable, zero-shot TTS using latent language modeling.

Abstract

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

TL;DR

CLaM-TTS introduces a two-stage framework that combines a Mel-VAE with Residual-Quantized VAE (RVQ) and a latent language model to enable zero-shot TTS with long-discrete-token sequences compressed into a compact latent representation. By predicting latent z_t via a Gaussian Mixture conditioned on text and prior codes, and then quantizing to discrete tokens, the model emits multiple tokens per step, avoiding cascaded generation across multiple streams. Trained on 100K hours across 11 languages, CLaM-TTS demonstrates competitive or superior naturalness, intelligibility, and speaker similarity, with faster inference compared to neural codec baselines, and is shown to benefit from larger LM pretraining and careful text-tokenization choices. The work also analyzes codeword rate, robustness, and prompting strategies, offering practical guidance for scalable, zero-shot TTS using latent language modeling.

Abstract

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding naturalness, intelligibility, speaker similarity, and inference speed. In addition, we examine the impact of the pretraining extent of the language models and their text tokenization strategies on performances.
Paper Structure (53 sections, 14 equations, 2 figures, 14 tables)

This paper contains 53 sections, 14 equations, 2 figures, 14 tables.

Figures (2)

  • Figure 1: An overview of CLaM-TTS. Training of CLaM-TTS unfolds in two stages: (a) we train a Mel-VAE that encodes a mel-spectrogram to the discrete latent representation from using probabilistic RVQ; (b) using the pre-trained residual vector quantizer from the first-stage, a latent language model, a Gaussian mixture (GM) based latent transformer decoder is trained; The decoder aims to predict latent variables that, when quantized, match with the ground-truth audio tokens.
  • Figure 2: The codebook usages of the probabilistic RVQ and prior RVQ method during training. The code book usages are plotted for the 0th, 8th, 16th, 24th, and 31st depths, from left to right.