Table of Contents
Fetching ...

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima

TL;DR

This work addresses the practicality gap of zero-shot TTS by proposing a lightweight approach that leverages a mixture of adapters (MoA) gated by speaker embeddings. By inserting MoA modules into the decoder and variance predictors of a FastSpeech2-based backbone, and employing dense and sparse gating with an auxiliary importance loss, the model achieves high naturalness and speaker similarity with under 40% of the parameters and 1.9× faster inference. Objective and subjective evaluations demonstrate superior performance over baselines across diverse speakers, with weight analyses showing learned, speaker-specific adapter activation. The method enables scalable, edge-friendly zero-shot TTS and could extend to larger architectures like VALL-E to further broaden speaker coverage while maintaining efficiency.

Abstract

The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40\% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (https://ntt-hilab-gensp.github.io/is2024lightweightTTS/).

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

TL;DR

This work addresses the practicality gap of zero-shot TTS by proposing a lightweight approach that leverages a mixture of adapters (MoA) gated by speaker embeddings. By inserting MoA modules into the decoder and variance predictors of a FastSpeech2-based backbone, and employing dense and sparse gating with an auxiliary importance loss, the model achieves high naturalness and speaker similarity with under 40% of the parameters and 1.9× faster inference. Objective and subjective evaluations demonstrate superior performance over baselines across diverse speakers, with weight analyses showing learned, speaker-specific adapter activation. The method enables scalable, edge-friendly zero-shot TTS and could extend to larger architectures like VALL-E to further broaden speaker coverage while maintaining efficiency.

Abstract

The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40\% of parameters at 1.9 times faster inference speed. Audio samples are available on our demo page (https://ntt-hilab-gensp.github.io/is2024lightweightTTS/).
Paper Structure (12 sections, 2 equations, 17 figures, 3 tables)

This paper contains 12 sections, 2 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Overview of proposed method
  • Figure 2: FFT with MoA
  • Figure 3: Predictor with MoA
  • Figure 4: Overview of MoA module
  • Figure 6: MCD (all)
  • ...and 12 more figures