Table of Contents
Fetching ...

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

Wenbin Wang, Yang Song, Sanjay Jha

TL;DR

USAT presents a universal speaker-adaptive TTS framework capable of both zero-shot and few-shot adaptation through instant and fine-grained modes. It introduces a memory-augmented VAE, a timbre converter with discriminators, and lightweight adapters to balance generalization, storage, and forgetting. The ESLTTS dataset enables robust evaluation on heavily accented non-native English speech, and experiments show USAT outperforms existing zero-shot and few-shot baselines across naturalness and speaker similarity metrics. The work offers practical benefits for real-world deployment and sets a foundation for future enhancements in zero-shot generalization and efficient adapter-based adaptation.

Abstract

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.

USAT: A Universal Speaker-Adaptive Text-to-Speech Approach

TL;DR

USAT presents a universal speaker-adaptive TTS framework capable of both zero-shot and few-shot adaptation through instant and fine-grained modes. It introduces a memory-augmented VAE, a timbre converter with discriminators, and lightweight adapters to balance generalization, storage, and forgetting. The ESLTTS dataset enables robust evaluation on heavily accented non-native English speech, and experiments show USAT outperforms existing zero-shot and few-shot baselines across naturalness and speaker similarity metrics. The work offers practical benefits for real-world deployment and sets a foundation for future enhancements in zero-shot generalization and efficient adapter-based adaptation.

Abstract

Conventional text-to-speech (TTS) research has predominantly focused on enhancing the quality of synthesized speech for speakers in the training dataset. The challenge of synthesizing lifelike speech for unseen, out-of-dataset speakers, especially those with limited reference data, remains a significant and unresolved problem. While zero-shot or few-shot speaker-adaptive TTS approaches have been explored, they have many limitations. Zero-shot approaches tend to suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents. While few-shot methods can reproduce highly varying accents, they bring a significant storage burden and the risk of overfitting and catastrophic forgetting. In addition, prior approaches only provide either zero-shot or few-shot adaptation, constraining their utility across varied real-world scenarios with different demands. Besides, most current evaluations of speaker-adaptive TTS are conducted only on datasets of native speakers, inadvertently neglecting a vast portion of non-native speakers with diverse accents. Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits. To alleviate the insufficient generalization performance observed in zero-shot speaker adaptation, we designed two innovative discriminators and introduced a memory mechanism for the speech decoder. To prevent catastrophic forgetting and reduce storage implications for few-shot speaker adaptation, we designed two adapters and a unique adaptation procedure.
Paper Structure (43 sections, 12 equations, 13 figures, 7 tables)

This paper contains 43 sections, 12 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: The training and inference procedures of USAT's instant and fine-grained adaptation. For clarity in the illustration, the diagram omits some data flows from the timbre converter to the duration predictor.
  • Figure 2: The architecture of the memory-augmented variational autoencoder.
  • Figure 3: The architecture of the timbre converter.
  • Figure 4: The flowchart of the phoneme encoder and duration predictor.
  • Figure 5: The architecture of the flow adapter, LN denotes layer normalization, Act. represent ReLU activation function, Down and Up indicate down-projection and up-projection modules. Both projection modules can be either linear or convolution layers depending on the type of adapter.
  • ...and 8 more figures