RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

Yongjoon Lee; Jung-Woo Choi

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

Yongjoon Lee, Jung-Woo Choi

Abstract

We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

Abstract

Paper Structure (37 sections, 19 equations, 3 figures, 6 tables, 1 algorithm)

This paper contains 37 sections, 19 equations, 3 figures, 6 tables, 1 algorithm.

Introduction
Related works and preliminary
SSL models in speech generative models
Training strategies for improving GAN vocoders
Relativistic Pairing GAN
Relativistic Adversarial Feedback
Quality gap
Discriminator gap
Adversarial training objective
Zero-centered gradient penalty
Mel reconstruction and feature matching loss
Final loss
Effect of segment size on quality estimation
Experiments
Baseline methods
...and 22 more sections

Figures (3)

Figure 1: Overall framework of the RAF. (a) represents the process of deriving the non-separable discriminator gap and quality gap. (b) represents the training objectives for the generator and the discriminator. $\phi$ denotes the mel transformation. $i$ denotes the data index.
Figure 2: An illustration of the effect of segment size on quality estimation error.
Figure 3: Toy experiments following sun2020bettergloballosslandscape. (a), (b), and (c) represent the training process of MetricGAN-RAF-v1, MetricGAN-RAF-v2, and RAF for two-cluster data, respectively. True data are red, and fake data are blue. RAF escapes mode collapse more quickly than the other training objectives.

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

Abstract

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

Authors

Abstract

Table of Contents

Figures (3)