Table of Contents
Fetching ...

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima

TL;DR

This work tackles the problem of degraded quality in noise-affected zero-shot TTS that relies on SSL-based speaker embeddings. It introduces adapter-based fine-tuning in the SSL embedding extractor, complemented by a speech enhancement front-end, to achieve robust speaker representations under noisy conditions. Through pre-training on a large multi-speaker Japanese corpus and fine-tuning with noisy data, the approach improves objective metrics and subjective naturalness and similarity under noise. The combination of BN and CNN adapters with SE demonstrates practical potential for high-quality, personalized TTS from imperfect reference speech.

Abstract

The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method. We incorporated adapters into the SSL model, which we fine-tuned with the TTS model using noisy reference speech. In addition, to further improve performance, we adopted a speech enhancement (SE) front-end. With these improvements, our proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech. Through the objective and subjective evaluations, we confirmed that the proposed method is highly robust to noise in reference speech, and effectively works in combination with SE.

Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

TL;DR

This work tackles the problem of degraded quality in noise-affected zero-shot TTS that relies on SSL-based speaker embeddings. It introduces adapter-based fine-tuning in the SSL embedding extractor, complemented by a speech enhancement front-end, to achieve robust speaker representations under noisy conditions. Through pre-training on a large multi-speaker Japanese corpus and fine-tuning with noisy data, the approach improves objective metrics and subjective naturalness and similarity under noise. The combination of BN and CNN adapters with SE demonstrates practical potential for high-quality, personalized TTS from imperfect reference speech.

Abstract

The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method. We incorporated adapters into the SSL model, which we fine-tuned with the TTS model using noisy reference speech. In addition, to further improve performance, we adopted a speech enhancement (SE) front-end. With these improvements, our proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech. Through the objective and subjective evaluations, we confirmed that the proposed method is highly robust to noise in reference speech, and effectively works in combination with SE.
Paper Structure (12 sections, 6 figures, 2 tables)

This paper contains 12 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of SSL-based TTS model. Non-autoregressive TTS model is conditioned with latent representations from SSL model. LR denotes length regulator. Duration predictor and other components are conditioned on speaker embeddings separately.
  • Figure 2: SSL model.
  • Figure 3: Adapter for Transformer.
  • Figure 4: Adapter for CNN Block.
  • Figure 6: Weights that aggregate representations from WavLM for embedding of acoustic features.
  • ...and 1 more figures