Table of Contents
Fetching ...

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria

TL;DR

HyperTTS introduces a parameter-efficient approach for adapting a multi-speaker TTS backbone to new speakers by conditioning adapters on speaker representations via a small hypernetwork. The hypernetwork dynamically generates adapter weights, expanding the effective parameter space while keeping the backbone frozen, achieving near-fine-tuning performance with less than 1% of backbone parameters added. Across LibriTTS and VCTK, HyperTTS surpasses static AdapterTTS and approaches full fine-tuning accuracy, with subjective MOS gains and strong speaker similarity, demonstrating practicality for scalable, domain-generic multi-speaker TTS. The work highlights dynamic, per-speaker adaptation as a promising direction, while noting training challenges and potential enhancements such as alternative normalization strategies and future exploration of LoRA-based variants.

Abstract

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

TL;DR

HyperTTS introduces a parameter-efficient approach for adapting a multi-speaker TTS backbone to new speakers by conditioning adapters on speaker representations via a small hypernetwork. The hypernetwork dynamically generates adapter weights, expanding the effective parameter space while keeping the backbone frozen, achieving near-fine-tuning performance with less than 1% of backbone parameters added. Across LibriTTS and VCTK, HyperTTS surpasses static AdapterTTS and approaches full fine-tuning accuracy, with subjective MOS gains and strong speaker similarity, demonstrating practicality for scalable, domain-generic multi-speaker TTS. The work highlights dynamic, per-speaker adaptation as a promising direction, while noting training challenges and potential enhancements such as alternative normalization strategies and future exploration of LoRA-based variants.

Abstract

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.
Paper Structure (36 sections, 2 equations, 4 figures, 6 tables)

This paper contains 36 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of our approach against baselines: Fine-tuning tunes the backbone model parameters on the adaptation dataset. AdapterTTS inserts learnable modules into the backbone. HyperTTS (ours) converts the static adapter modules to dynamic by speaker-conditional sampling using a (learnable) hypernetwork. Both AdapterTTS and HyperTTS keep the backbone model parameters frozen and thus parameter efficient.
  • Figure 2: An overview of the HyperTTS. SE and LE denote speaker embedding and layer embedding.
  • Figure 3: XAB speaker similarity test results between AdapterTTS, and $HyperTTS_{d}$.
  • Figure 4: t-SNE of hypernetwork generated parameters for 20 randomly sampled speakers from VCTK test set. The same color marks represent reference speech from the same speaker.