Table of Contents
Fetching ...

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka

TL;DR

This paper tackles the data scarcity problem in GAN-based neural vocoders by introducing Augmentation-Conditional Discriminator (AugCondD), which conditions the discriminator on the augmentation state to prevent augmented samples from distorting the learning of the original speech distribution. AugCondD is integrated into a GAN-based vocoder framework and uses augmentation state $\mu$ derived from mixup-based data augmentation, with input concatenation to the discriminator ensuring the model learns both augmented and non-augmented distributions. Empirical results on LJSpeech show that AugCondD substantially improves speech quality under limited data while remaining competitive with strong baselines under full data; its general utility is demonstrated across different network architectures, augmentation methods (e.g., mixup and speaking-rate changes), and speakers. The approach is simple to implement and can be extended to end-to-end models and fine-tuning, offering a practical pathway to high-quality vocoding with limited data and reduced data-collection costs.

Abstract

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.

Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

TL;DR

This paper tackles the data scarcity problem in GAN-based neural vocoders by introducing Augmentation-Conditional Discriminator (AugCondD), which conditions the discriminator on the augmentation state to prevent augmented samples from distorting the learning of the original speech distribution. AugCondD is integrated into a GAN-based vocoder framework and uses augmentation state derived from mixup-based data augmentation, with input concatenation to the discriminator ensuring the model learns both augmented and non-augmented distributions. Empirical results on LJSpeech show that AugCondD substantially improves speech quality under limited data while remaining competitive with strong baselines under full data; its general utility is demonstrated across different network architectures, augmentation methods (e.g., mixup and speaking-rate changes), and speakers. The approach is simple to implement and can be extended to end-to-end models and fine-tuning, offering a practical pathway to high-quality vocoding with limited data and reduced data-collection costs.

Abstract

A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
Paper Structure (12 sections, 6 equations, 4 figures, 4 tables)

This paper contains 12 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of standard discriminator with proposed AugCond$D$. (a) A standard discriminator, unconditional and agnostic to the augmentation state, may consider augmented speech (which can be extraordinary) as the desired real speech. (b) AugCond$D$ receives not only augmented speech but also the augmentation state, allowing it to assess the input speech conditioned on the augmentation state without interfering with the learning of the original non-augmented distribution.
  • Figure 2: Comparison of data-augmentation strategies. "$\mathrm{Ext}$" and "$\mathrm{Aug}$" denote an intermediate representation extractor and augmentation operator, respectively. The red variable and red arrow indicate augmented data and augmented data flow, respectively. Two data-augmentation strategies can be considered for the GAN-based vocoder: (a) augmenting only the training data for $D$; (b) augmenting data for both $G$ and $D$.
  • Figure 3: Comparison of process flows for a GAN with a standard discriminator and GAN with AugCond$D$. (a) Standard discriminator $D(\tilde{x})$ receives augmented speech $\tilde{x}$ only and is agnostic to the augmentation state $\mu$. (b) AugCond$D$$D(\tilde{x}, \mu)$ accepts $\mu$ in addition to $\tilde{x}$, allowing AugCond$D$ to assess $\tilde{x}$ while considering $\mu$.
  • Figure 4: Process of input concatenation. $A \times B$ indicates a tensor shape with time length $A$ and $B$ channels; $t$ and $d$ denote the time length of augmented speech $\tilde{x}$ and the dimension of augmentation state $\mu$, respectively. After $\mu$ is expanded by a factor of $t$ in the temporal direction, it is concatenated with $\tilde{x}$ in the channel direction. Finally, the concatenated tensor is input into $D$. When $\mu$ is a scalar (as in the experiments), $\mu$ is first expanded to a $1 \times 1$ tensor (i.e., $d = 1$) and then the above process is adopted.