Table of Contents
Fetching ...

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui, Yong Ma, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng

TL;DR

DisCo-Speech tackles the entanglement of timbre and prosody in codec-based TTS by introducing a disentangled speech codec (DisCodec) and a transformer LM. It uses a two-stage training scheme to separate content, prosody, and timbre, and then fuses content and prosody into LM-predictable tokens while injecting timbre at synthesis. The approach enables zero-shot controllable speech generation, achieving competitive voice cloning and superior zero-shot prosody control compared to baselines. The work provides a codec-level solution that harmonizes with standard LMs for flexible, controllable synthesis and outlines directions to improve disentanglement, reconstruction, and data diversity.

Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

TL;DR

DisCo-Speech tackles the entanglement of timbre and prosody in codec-based TTS by introducing a disentangled speech codec (DisCodec) and a transformer LM. It uses a two-stage training scheme to separate content, prosody, and timbre, and then fuses content and prosody into LM-predictable tokens while injecting timbre at synthesis. The approach enables zero-shot controllable speech generation, achieving competitive voice cloning and superior zero-shot prosody control compared to baselines. The work provides a codec-level solution that harmonizes with standard LMs for flexible, controllable synthesis and outlines directions to improve disentanglement, reconstruction, and data diversity.

Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.

Paper Structure

This paper contains 18 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overview of DisCo-Speech.
  • Figure 2: The structure and two-stage training of DisCodec.
  • Figure 3: Disentanglement visualization.