Table of Contents
Fetching ...

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

Yejin Jeon, Yunsu Kim, Gary Geunbae Lee

TL;DR

An innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity.

Abstract

Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-`a-vis alternative baseline models.

Enhancing Zero-Shot Multi-Speaker TTS with Negated Speaker Representations

TL;DR

An innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity.

Abstract

Zero-shot multi-speaker TTS aims to synthesize speech with the voice of a chosen target speaker without any fine-tuning. Prevailing methods, however, encounter limitations at adapting to new speakers of out-of-domain settings, primarily due to inadequate speaker disentanglement and content leakage. To overcome these constraints, we propose an innovative negation feature learning paradigm that models decoupled speaker attributes as deviations from the complete audio representation by utilizing the subtraction operation. By eliminating superfluous content information from the speaker representation, our negation scheme not only mitigates content leakage, thereby enhancing synthesis robustness, but also improves speaker fidelity. In addition, to facilitate the learning of diverse speaker attributes, we leverage multi-stream Transformers, which retain multiple hypotheses and instigate a training paradigm akin to ensemble learning. To unify these hypotheses and realize the final speaker representation, we employ attention pooling. Finally, in light of the imperative to generate target text utterances in the desired voice, we adopt adaptive layer normalizations to effectively fuse the previously generated speaker representation with the target text representations, as opposed to mere concatenation of the text and audio modalities. Extensive experiments and validations substantiate the efficacy of our proposed approach in preserving and harnessing speaker-specific attributes vis-`a-vis alternative baseline models.
Paper Structure (20 sections, 6 equations, 5 figures, 7 tables)

This paper contains 20 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Main architecture: (a) Negated feature generation, which results in CIF embeddings, are passed to (b) multi-stream Transformers to formulate multi-perspective speaker representations, which are then (c) injected into the TTS backbone. Target speaker embeddings are fused together with input text representations at the fusion encoder, and re-injected into the mel decoder for further speaker preservation.
  • Figure 2: Comparative visualizations of distinct speaker embeddings. From left to right are baselines Mellotron, SALN, and our proposed model.
  • Figure 3: ABX comparisons between generated audio from different models. Best seen in color.
  • Figure 4: Mel-spectrogram comparisons between Mellotron, SALN, and the proposed system (from top to bottom). Regions of notable disparities are highlighted in boxes. Notably, the uppermost sub-figure accentuates the instance of erroneously inserted enunciation. In addition, compared to the other models, voiceless alveolar fricative /s is correctly generated in the last sub-figure.
  • Figure 5: Long-tail distributions of the audio durations in the training dataset. The average duration for approximately 26,500 audio samples is 5.37 seconds (indicated in dark blue). Minimum and maximum audio lengths are 0 and 33 seconds, respectively.