Table of Contents
Fetching ...

DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness

Vikentii Pankov, Valeria Pronina, Alexander Kuzmin, Maksim Borisov, Nikita Usoltsev, Xingshan Zeng, Alexander Golubkov, Nikolai Ermolenko, Aleksandra Shirshova, Yulia Matveeva

TL;DR

This work tackles zero-shot TTS robustness to real-world noise by jointly training a speaker encoder with a self-supervised DINO loss alongside speech synthesis objectives, enabling richer style capture and noise resilience without external denoisers. The proposed DINO-VITS architecture combines HuBERT-based content units (S2U), an mBART-based unit mapper (T2U), a pretrained speaker encoder (CAM++), and a VITS-based unit-to-speech generator, with a two-stage training regimen to avoid forgetting. Empirical results show significant gains in naturalness and speaker similarity in both clean and noisy conditions, and a two-stage, noise-aware training regime improves performance when using noisy unlabeled data, outperforming ASR-transcription-based baselines. The approach also demonstrates that HuBERT features inherently encode noise differences, enabling effective semi-supervised learning for noise-robust TTS. Overall, the method offers a practical path to high-quality, noise-robust voice cloning using data-efficient, self-supervised training.

Abstract

We address zero-shot TTS systems' noise-robustness problem by proposing a dual-objective training for the speaker encoder using self-supervised DINO loss. This approach enhances the speaker encoder with the speech synthesis objective, capturing a wider range of speech characteristics beneficial for voice cloning. At the same time, the DINO objective improves speaker representation learning, ensuring robustness to noise and speaker discriminability. Experiments demonstrate significant improvements in subjective metrics under both clean and noisy conditions, outperforming traditional speaker-encoderbased TTS systems. Additionally, we explore training zeroshot TTS on noisy, unlabeled data. Our two-stage training strategy, leveraging self-supervised speech models to distinguish between noisy and clean speech, shows notable advances in similarity and naturalness, especially with noisy training datasets, compared to the ASR-transcription-based approach.

DINO-VITS: Data-Efficient Zero-Shot TTS with Self-Supervised Speaker Verification Loss for Noise Robustness

TL;DR

This work tackles zero-shot TTS robustness to real-world noise by jointly training a speaker encoder with a self-supervised DINO loss alongside speech synthesis objectives, enabling richer style capture and noise resilience without external denoisers. The proposed DINO-VITS architecture combines HuBERT-based content units (S2U), an mBART-based unit mapper (T2U), a pretrained speaker encoder (CAM++), and a VITS-based unit-to-speech generator, with a two-stage training regimen to avoid forgetting. Empirical results show significant gains in naturalness and speaker similarity in both clean and noisy conditions, and a two-stage, noise-aware training regime improves performance when using noisy unlabeled data, outperforming ASR-transcription-based baselines. The approach also demonstrates that HuBERT features inherently encode noise differences, enabling effective semi-supervised learning for noise-robust TTS. Overall, the method offers a practical path to high-quality, noise-robust voice cloning using data-efficient, self-supervised training.

Abstract

We address zero-shot TTS systems' noise-robustness problem by proposing a dual-objective training for the speaker encoder using self-supervised DINO loss. This approach enhances the speaker encoder with the speech synthesis objective, capturing a wider range of speech characteristics beneficial for voice cloning. At the same time, the DINO objective improves speaker representation learning, ensuring robustness to noise and speaker discriminability. Experiments demonstrate significant improvements in subjective metrics under both clean and noisy conditions, outperforming traditional speaker-encoderbased TTS systems. Additionally, we explore training zeroshot TTS on noisy, unlabeled data. Our two-stage training strategy, leveraging self-supervised speech models to distinguish between noisy and clean speech, shows notable advances in similarity and naturalness, especially with noisy training datasets, compared to the ASR-transcription-based approach.
Paper Structure (16 sections, 1 equation, 1 figure, 4 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 1 figure, 4 tables, 1 algorithm.

Figures (1)

  • Figure 1: Architecture of the proposed method (DINO-VITS). The CAM++ Teacher is an exponential moving average (EMA) of CAM++ weights. The Center operation subtracts EMA of previous speaker embeddings from teacher output.