Table of Contents
Fetching ...

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Kang Yin, Chunyu Qiang, Sirui Zhao, Xiaopeng Wang, Yuzhe Liang, Pengfei Cai, Tong Xu, Chen Zhang, Enhong Chen

TL;DR

<3-5 sentence high-level summary> DMP-TTS tackles the entanglement of speaker timbre and speaking style in controllable TTS by introducing a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. The key contributions are Style-CLAP, a unified multi-modal style encoder trained with contrastive and multi-task objectives; chained classifier-free guidance (cCFG) for independent, continuous control over content, timbre, and style; and Representation Alignment (REPA) to stabilize training by distilling acoustic-semantic priors from Whisper. Empirical results on a large internal Chinese corpus show superior style controllability with both text and audio prompts, competitive intelligibility, and strong naturalness, with ablations verifying the effectiveness of each component. These findings suggest scalable pathways for more flexible and robust controllable TTS in multi-modal settings.

Abstract

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

TL;DR

<3-5 sentence high-level summary> DMP-TTS tackles the entanglement of speaker timbre and speaking style in controllable TTS by introducing a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. The key contributions are Style-CLAP, a unified multi-modal style encoder trained with contrastive and multi-task objectives; chained classifier-free guidance (cCFG) for independent, continuous control over content, timbre, and style; and Representation Alignment (REPA) to stabilize training by distilling acoustic-semantic priors from Whisper. Empirical results on a large internal Chinese corpus show superior style controllability with both text and audio prompts, competitive intelligibility, and strong naturalness, with ablations verifying the effectiveness of each component. These findings suggest scalable pathways for more flexible and robust controllable TTS in multi-modal settings.

Abstract

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.

Paper Structure

This paper contains 16 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: (a) Overall architecture of DMP-TTS. (b) Unified multi-modal style encoder.
  • Figure 2: Effect of CFG strength on (a) speaker similarity and (b) emotion accuracy.