Table of Contents
Fetching ...

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, Lu Wang

TL;DR

IndexTTS tackles industrial-scale, controllable zero-shot TTS by combining an LLM-based text-to-codec framework with a Conformer-conditioned encoder and a BigVGAN2 decoder. It introduces a character-pinyin hybrid approach for Chinese pronunciation control, analyzes VQ versus FSQ for codebook utilization, and enables end-to-end learning from raw text with minimal preprocessing. Key contributions include end-to-end pronunciation correction, robust speaker conditioning via a multi-reference Perceiver, and direct waveform generation for efficient inference, resulting in state-of-the-art performance across multiple baselines and test sets. The work demonstrates practical impact for scalable, controllable TTS in real-world content creation, with publicly available demos and a detailed evaluation of accuracy, timbre, and latency.

Abstract

Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.

IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

TL;DR

IndexTTS tackles industrial-scale, controllable zero-shot TTS by combining an LLM-based text-to-codec framework with a Conformer-conditioned encoder and a BigVGAN2 decoder. It introduces a character-pinyin hybrid approach for Chinese pronunciation control, analyzes VQ versus FSQ for codebook utilization, and enables end-to-end learning from raw text with minimal preprocessing. Key contributions include end-to-end pronunciation correction, robust speaker conditioning via a multi-reference Perceiver, and direct waveform generation for efficient inference, resulting in state-of-the-art performance across multiple baselines and test sets. The work demonstrates practical impact for scalable, controllable TTS in real-world content creation, with publicly available demos and a detailed evaluation of accuracy, timbre, and latency.

Abstract

Recently, large language model (LLM) based text-to-speech (TTS) systems have gradually become the mainstream in the industry due to their high naturalness and powerful zero-shot voice cloning capabilities.Here, we introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model. We add some novel improvements. Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin, making the pronunciations of polyphonic characters and long-tail characters controllable. We also performed a comparative analysis of the Vector Quantization (VQ) with Finite-Scalar Quantization (FSQ) for codebook utilization of acoustic speech tokens. To further enhance the effect and stability of voice cloning, we introduce a conformer-based speech conditional encoder and replace the speechcode decoder with BigVGAN2. Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning. As for the popular TTS systems in the open-source, such as Fish-Speech, CosyVoice2, FireRedTTS and F5-TTS, IndexTTS has a relatively simple training process, more controllable usage, and faster inference speed. Moreover, its performance surpasses that of these systems. Our demos are available at https://index-tts.github.io.

Paper Structure

This paper contains 18 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An overview of IndexTTS, a text-to-speech language model conditioned on prompt speech and text tokens generates acoustic tokens, and the BigVGAN2 decoder convert the LLM output latent into waveform.
  • Figure 2: Compare the distribution of codebook utilization rates of VQ and FSQ under different training data scales