Table of Contents
Fetching ...

Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement

Qianniu Chen, Xiaoyang Hao, Bowen Li, Yue Liu, Li Lu

TL;DR

The paper tackles the high resource costs and privacy concerns of zero-shot TTS by proposing a lightweight architecture that separately models linguistic content and multi-level speaker attributes from source and prompt speech. A two-stage self-distillation framework generates parallel data to disentangle content and speaker at the data level, guiding a student TTS to rely on prompt speech for speaker conditioning while preserving content. Empirical results show superior content integrity (CER and MOS-con) and strong speaker similarity with far smaller parameter counts ($\approx$22.5M) and real-time factors ($\text{RTF}_{\text{CPU}}=0.13$, $\text{RTF}_{\text{GPU}}=0.012$) compared with baselines. The approach enables efficient zero-shot synthesis suitable for on-device deployment and privacy-preserving applications without sacrificing quality.

Abstract

Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.

Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement

TL;DR

The paper tackles the high resource costs and privacy concerns of zero-shot TTS by proposing a lightweight architecture that separately models linguistic content and multi-level speaker attributes from source and prompt speech. A two-stage self-distillation framework generates parallel data to disentangle content and speaker at the data level, guiding a student TTS to rely on prompt speech for speaker conditioning while preserving content. Empirical results show superior content integrity (CER and MOS-con) and strong speaker similarity with far smaller parameter counts (22.5M) and real-time factors (, ) compared with baselines. The approach enables efficient zero-shot synthesis suitable for on-device deployment and privacy-preserving applications without sacrificing quality.

Abstract

Zero-shot Text-To-Speech (TTS) synthesis shows great promise for personalized voice customization through voice cloning. However, current methods for achieving zero-shot TTS heavily rely on large model scales and extensive training datasets to ensure satisfactory performance and generalizability across various speakers. This raises concerns regarding both deployment costs and data security. In this paper, we present a lightweight and stable zero-shot TTS system. We introduce a novel TTS architecture designed to effectively model linguistic content and various speaker attributes from source speech and prompt speech, respectively. Furthermore, we present a two-stage self-distillation framework that constructs parallel data pairs for effectively disentangling linguistic content and speakers from the perspective of training data. Extensive experiments show that our system exhibits excellent performance and superior stability on the zero-shot TTS tasks. Moreover, it shows markedly superior computational efficiency, with RTFs of 0.13 and 0.012 on the CPU and GPU, respectively.
Paper Structure (12 sections, 1 equation, 4 figures, 1 table)

This paper contains 12 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Architecture of the proposed TTS model. The red dashed line indicates participation only during training. The Timbre encoder is a pre-trained model requiring no optimization.
  • Figure 2: Two-stage training framework.
  • Figure 3: t-SNE visualization of speaker embeddings from our system with and without self-distillation. Triangles represent synthetic samples, circles represent real samples.
  • Figure 4: Impact of the self-distillation coefficient $\sigma$ on SIM and CER.