Multi-modal Adversarial Training for Zero-Shot Voice Cloning

John Janiczek; Dading Chong; Dongyang Dai; Arlo Faria; Chao Wang; Tao Wang; Yuzong Liu

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

John Janiczek, Dading Chong, Dongyang Dai, Arlo Faria, Chao Wang, Tao Wang, Yuzong Liu

TL;DR

The paper tackles the challenge of oversmoothing in TTS and the difficulty of zero-shot voice cloning by introducing a GAN-based framework that employs a Transformer encoder-decoder as a Multi-modal Fusion Discriminator. It jointly trains a FastSpeech2-based acoustic model with discriminators that operate on both acoustic and prosodic features, using contextual information from text and speaker identity to guide generation. Through a Multi-feature Generative Adversarial Training approach, the model achieves higher quality (NISQA MOS) and speaker similarity, and exhibits more expressive prosody as measured by pitch variation, particularly on unseen speakers from Libriheavy and LibriTTS-R. The method maintains real-time CPU-friendly inference, demonstrating practical applicability in resource-constrained settings for high-quality zero-shot voice cloning.

Abstract

A text-to-speech (TTS) model trained to reconstruct speech given text tends towards predictions that are close to the average characteristics of a dataset, failing to model the variations that make human speech sound natural. This problem is magnified for zero-shot voice cloning, a task that requires training data with high variance in speaking styles. We build off of recent works which have used Generative Advsarial Networks (GAN) by proposing a Transformer encoder-decoder architecture to conditionally discriminates between real and generated speech features. The discriminator is used in a training pipeline that improves both the acoustic and prosodic features of a TTS model. We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset, for the task of zero-shot voice cloning. Our model achieves improvements over the baseline in terms of speech quality and speaker similarity. Audio examples from our system are available online.

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

TL;DR

Abstract

Multi-modal Adversarial Training for Zero-Shot Voice Cloning

Authors

TL;DR

Abstract

Table of Contents

Figures (1)