Intelli-Z: Toward Intelligible Zero-Shot TTS

Sunghee Jung; Won Jang; Jaesam Yoon; Bongwan Kim

Intelli-Z: Toward Intelligible Zero-Shot TTS

Sunghee Jung, Won Jang, Jaesam Yoon, Bongwan Kim

TL;DR

Intelli-Z targets the intelligibility gap in zero-shot TTS by leveraging three mechanisms: knowledge distillation from a pretrained multi-speaker TTS as prototype embeddings, cycle-consistency to reduce text-content leakage during inference, and selective temporal aggregation that uses an aperiodicity mask to focus on voiced frames. The approach uses a meta-learning-inspired framework with support and query sets, plus adversarial training to align outputs with speaker identity. Empirical results show that applying these methods yields notable gains in MOS/SMOS for unseen speakers, with a 9% improvement from the first two methods and a 16% improvement when the aperiodicity mask is applied; qualitative analyses confirm that the mask reduces lisping and improves disentanglement. Overall, Intelli-Z demonstrates a practical path to more intelligible zero-shot TTS in real-world, diverse speaker settings, enabling more robust voice cloning and personalization without extensive target-speaker data.

Abstract

Although numerous recent studies have suggested new frameworks for zero-shot TTS using large-scale, real-world data, studies that focus on the intelligibility of zero-shot TTS are relatively scarce. Zero-shot TTS demands additional efforts to ensure clear pronunciation and speech quality due to its inherent requirement of replacing a core parameter (speaker embedding or acoustic prompt) with a new one at the inference stage. In this study, we propose a zero-shot TTS model focused on intelligibility, which we refer to as Intelli-Z. Intelli-Z learns speaker embeddings by using multi-speaker TTS as its teacher and is trained with a cycle-consistency loss to include mismatched text-speech pairs for training. Additionally, it selectively aggregates speaker embeddings along the temporal dimension to minimize the interference of the text content of reference speech at the inference stage. We substantiate the effectiveness of the proposed methods with an ablation study. The Mean Opinion Score (MOS) increases by 9% for unseen speakers when the first two methods are applied, and it further improves by 16% when selective temporal aggregation is applied.

Intelli-Z: Toward Intelligible Zero-Shot TTS

TL;DR

Abstract

Paper Structure (14 sections, 10 equations, 2 figures, 2 tables)

This paper contains 14 sections, 10 equations, 2 figures, 2 tables.

Introduction
Meta-StyleSpeech
Support set and query set
Adversarial loss for training query set
Proposed idea
Knowledge distillation from multi-speaker TTS
Cycle-consistency loss
Selective Voice Frame Aggregation
Experiments and results
Dataset
Model details
Subjective evaluation
Qualitative evaluation of aperiodicity masks
Conclusion

Figures (2)

Figure 1: (a): Support set training scheme of Meta-StyleSpeech. Text encoder and text discriminator are omitted in the figure for simplicity. The variance adaptor and decoder blocks are inside the generator, as in FastSpeech2. (b): Query set training scheme of Meta-StyleSpeech. (c): Support set training scheme of Intelli-Z. Text encoder is omitted for simplicity. (d): Query set training scheme of Intelli-Z.
Figure 2: (a), (b): Mel-spectrograms synthesized without applying the aperiodicity mask. (c), (d): Mel-spectrograms synthesized with the application of the aperiodicity mask. The black lines observed in the red boxes in (a) and (b) result in lisping sounds, and /s/ sounds become /th/ or /d/. This problem is resolved in (c) and (d).

Intelli-Z: Toward Intelligible Zero-Shot TTS

TL;DR

Abstract

Intelli-Z: Toward Intelligible Zero-Shot TTS

Authors

TL;DR

Abstract

Table of Contents

Figures (2)